Activation Compression Method for Deep Learning Acceleration

ABSTRACT

A system and method for multiplying matrices, and method for training a convolutional neural network (CNN), are provided. The system includes a processor and a matrix multiply accelerator (MMA). The processor is configured to generate, based on an input tensor, a number of basic block matrices, each basic block matrix including a number of elements; for each basic block matrix: prune, based on a sparsity value, the elements of the basic block matrix, generate a mask for the basic block matrix, each mask including a number of bits, each bit corresponding to a different element of the basic block matrix, and compress the basic block matrix to generate a compressed basic block matrix having fewer elements than the basic block matrix. The MMA is configured to multiply, based on the masks, the compressed basic block matrices and a weight matrix to generate an output matrix.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication Ser. No. 63/117,728 (filed on Nov. 24, 2020), the disclosureof which is incorporated herein by reference in its entirety.

BACKGROUND

The present disclosure relates to computer systems. More particularly,the present disclosure relates to a matrix multiplication system andmethod.

Artificial neural networks (ANNs), such as deep neural networks (DNNs),convolutional neural networks (CNNs), etc., are a popular solution to awide array of challenging classification, recognition and regressionproblems. However, many ANN models require a large number ofcalculations involving a large number of weights and activations, whichpresents a significant challenge with respect to access, storage andperformance, particularly for mobile and other power orstorage-constrained devices. An ANN hardware accelerator acceleratesthese calculations, such as, for example, convolution operationsperformed by CNNs.

Typically, native convolution operations are not performed by a CNN dueto the complicated dataflow and expensive datapaths that are usuallyrequired. Instead, native convolution operations are converted intogeneric matrix multiplication (GEMM) operations, and then the GEMMoperations are executed more efficiently by a central processing unit(CPU), specialized processor, hardware accelerator processing engine,etc., using optimized software libraries or specialized hardware. Moreparticularly, an “IM2COL” software function may be used to convert thefilter (weight) matrix and the input feature map (IFM) matrix for eachconvolution operation into an expanded format that is compatible with aGEMM operation. The IM2COL versions of each filter (weight) matrix andeach IFM matrix are generated and stored in memory, and then loaded frommemory and processed by the GEMM operation.

Generally, matrices may be classified as either sparse or dense. Mostelements of a sparse matrix have a value of zero, while most elements ofa dense matrix have a non-zero value. For the simple matrixmultiplication operation C=A·B, when matrix A or matrix B is sparse,most of the matrix calculations will include a value of zero for atleast one of the operands. When both matrix A and matrix B are sparse,an even greater number of matrix calculations will include a value ofzero for at least one of the operands. Since multiplication by anoperand that has a value of zero will always result in a product thathas a value of zero, applying standard matrix multiplication techniquesto sparse matrices is very inefficient due to the large number ofoperands that have a value of zero.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an ANN, in accordance with embodiments of the presentdisclosure.

FIG. 2 depicts a CNN, in accordance with embodiments of the presentdisclosure.

FIG. 3A depicts convolutional layer calculation for a CNN, FIG. 3Bdepicts a converted convolutional layer calculation for a CNN, and FIG.3C depicts a converted input data matrix, in accordance with anembodiment of the present disclosure.

FIG. 4 depicts a data flow diagram for a multiply-and-accumulate (MAC)array.

FIG. 5A depicts input feature maps, FIG. 5B depicts basic blocks, andFIG. 5C depicts basic block matrices, in accordance with an embodimentof the present disclosure.

FIG. 6A depicts a data flow diagram for a portion of a training processfor CNN 15, and FIGS. 6B and 6C depict block diagrams of dynamicactivation pruning (DAP) selection circuits, according to embodiments ofthe present disclosure.

FIGS. 7A depicts basic block matrices after DAP selection, and FIG. 7Bdepicts compressed basic block matrices, in accordance with anembodiment of the present disclosure

FIGS. 8A, 8B, 8C and 8D depict weight matrix re-sequencing and convertedinput data matrix re-sequencing and compression, in accordance with anembodiment of the present disclosure.

FIG. 9 depicts a data flow diagram for DAP MAC array, in accordance withan embodiment of the present disclosure.

FIG. 10 depicts a block diagram of system, in accordance with anembodiment of the present disclosure.

FIG. 11 depicts a block diagram of a matrix multiply accelerator (MMA),in accordance with an embodiment of the present disclosure.

FIG. 12 depicts a block diagram of a processing engine (PE) for an MMA,in accordance with an embodiment of the present disclosure.

FIG. 13 depicts a flow diagram representing functionality associatedwith multiplying matrices, in accordance with embodiments of the presentdisclosure.

DETAILED DESCRIPTION

Embodiments of the present disclosure will now be described withreference to the drawing figures, in which like reference numerals referto like parts throughout.

Embodiments of the present disclosure advantageously provide a systemand method for multiplying matrices that significantly reduce “multiplyby zero” conditions. Embodiments of the present disclosure areapplicable to the multiplication of a dense matrix with a “sparse”matrix, and many embodiments accommodate any degree of sparsity ratio.

In one embodiment, a system includes a processor coupled to a memory,and a matrix multiply accelerator (MMA) coupled to the processor and thememory. The processor is configured to generate, based on an inputtensor, a number of basic block matrices, each basic block matrixincluding a number of elements; for each basic block matrix: prune,based on a sparsity value, the elements of the basic block matrix,generate a mask for the basic block matrix, each mask including a numberof bits, each bit corresponding to a different element of the basicblock matrix, and compress the basic block matrix to generate acompressed basic block matrix having fewer elements than the basic blockmatrix. The MMA is configured to multiply, based on the masks, thecompressed basic block matrices and a weight matrix to generate anoutput matrix.

An ANN models the relationships between input data or signals and outputdata or signals using a network of interconnected nodes that is trainedthrough a learning process. The nodes are arranged into various layers,including, for example, an input layer, one or more hidden layers, andan output layer. The input layer receives input data, such as, forexample, image data, and the output layer generates output data, suchas, for example, a probability that the image data contains a knownobject. Each hidden layer provides at least a partial transformation ofthe input data to the output data. A DNN has multiple hidden layers inorder to model complex, nonlinear relationships between input data andoutput data.

In a fully-connected, feedforward ANN, each node is connected to all ofthe nodes in the preceding layer, as well as to all of the nodes in thesubsequent layer. For example, each input layer node is connected toeach hidden layer node, each hidden layer node is connected to eachinput layer node and each output layer node, and each output layer nodeis connected to each hidden layer node. Additional hidden layers aresimilarly interconnected. Each connection has a weight value, and eachnode has an activation function, such as, for example, a linearfunction, a step function, a sigmoid function, a tanh function, arectified linear unit (ReLU) function, etc., that determines the outputof the node based on the weighted sum of the inputs to the node. Theinput data propagates from the input layer nodes, through respectiveconnection weights to the hidden layer nodes, and then throughrespective connection weights to the output layer nodes.

More particularly, at each input node, input data is provided to theactivation function for that node, and the output of the activationfunction is then provided as an input data value to each hidden layernode. At each hidden layer node, the input data value received from eachinput layer node is multiplied by a respective connection weight, andthe resulting products are summed or accumulated into an activationvalue that is provided to the activation function for that node. Theoutput of the activation function is then provided as an input datavalue to each output layer node. At each output layer node, the outputdata value received from each hidden layer node is multiplied by arespective connection weight, and the resulting products are summed oraccumulated into an activation value that is provided to the activationfunction for that node. The output of the activation function is thenprovided as output data. Additional hidden layers may be similarlyconfigured to process data.

FIG. 1 depicts ANN 10, in accordance with an embodiment of the presentdisclosure.

ANN 10 includes input layer 20, one or more hidden layers 30, 40, 50,etc., and output layer 60. Input layer 20 includes one or more inputnodes 21, 22, 23, etc. Hidden layer 30 includes one or more hidden nodes31, 32, 33, 34, 35, etc. Hidden layer 40 includes one or more hiddennodes 41, 42, 43, 44, 45, etc. Hidden layer 50 includes one or morehidden nodes 51, 52, 53, 54, 55, etc. Output layer 60 includes one ormore output nodes 61, 62, etc. Generally, ANN 10 includes N hiddenlayers, input layer 20 includes “i” nodes, hidden layer 30 includes “j”nodes, hidden layer 40 includes “k” nodes, hidden layer 50 includes “m”nodes, and output layer 60 includes “o” nodes.

In one embodiment, N equals 3, i equals 3, j, k and m equal 5 and oequals 2 (depicted in FIG. 1). Input node 21 is coupled to hidden nodes31 to 35, input node 22 is coupled to hidden nodes 31 to 35, and inputnode 23 is coupled to hidden nodes 31 to 35. Hidden node 31 is coupledto hidden nodes 41 to 45, hidden node 32 is coupled to hidden nodes 41to 45, hidden node 33 is coupled to hidden nodes 41 to 45, hidden node34 is coupled to hidden nodes 41 to 45, and hidden node 35 is coupled tohidden nodes 41 to 45. Hidden node 41 is coupled to hidden nodes 51 to55, hidden node 42 is coupled to hidden nodes 51 to 55, hidden node 43is coupled to hidden nodes 51 to 55, hidden node 44 is coupled to hiddennodes 51 to 55, and hidden node 45 is coupled to hidden nodes 51 to 55.Hidden node 51 is coupled to output nodes 61 and 62, hidden node 52 iscoupled to output nodes 61 and 62, hidden node 53 is coupled to outputnodes 61 and 62, hidden node 54 is coupled to output nodes 61 and 62,and hidden node 55 is coupled to output nodes 61 and 62.

Many other variations of input, hidden and output layers are clearlypossible, including hidden layers that are locally-connected, ratherthan fully-connected, to one another.

Training an ANN includes optimizing the connection weights between nodesby minimizing the prediction error of the output data until the ANNachieves a particular level of accuracy. One method is backpropagation,or backward propagation of errors, which iteratively and recursivelydetermines a gradient descent with respect to the connection weights,and then adjusts the connection weights to improve the performance ofthe network.

A multi-layer perceptron (MLP) is a fully-connected ANN that has aninput layer, an output layer and one or more hidden layers. MLPs may beused for natural language processing applications, such as machinetranslation, speech recognition, etc. Other ANNs include recurrentneural networks (RNNs), long short-term memories (LSTMs),sequence-to-sequence models that include an encoder RNN and a decoderRNN, shallow neural networks, etc.

A CNN is a variation of an MLP that may be used for classification orrecognition applications, such as image recognition, speech recognition,etc. A CNN has an input layer, an output layer and multiple hiddenlayers including convolutional layers, pooling layers, normalizationlayers, fully-connected layers, etc. Each convolutional layer applies asliding dot product or cross-correlation to an input volume, applies anactivation function to the results, and then provides the activation oroutput volume to the next layer. Convolutional layers typically use theReLU function as the activation function. In certain embodiments, theactivation function is provided in a separate activation layer, such as,for example, a ReLU layer. A pooling layer reduces the dimensions of theoutput volume received from the preceding convolutional layer, and maycalculate an average or a maximum over small clusters of data, such as,for example, 2×2 matrices. In certain embodiments, a convolutional layerand a pooling layer may form a single layer of a CNN. Thefully-connected layers follow the convolutional and pooling layers, andinclude a flatten layer and a classification layer, followed by anormalization layer that includes a normalization function, such as theSoftMax function. The output layer follows the last fully-connectedlayer; in certain embodiments, the output layer may include thenormalization function.

FIG. 2 depicts CNN 15, in accordance with an embodiment of the presentdisclosure. CNN 15 includes input layer 20, one or more hidden layers,such as convolutional layer 30-1, pooling layer 30-2, hidden (flatten)layer 40, hidden (classification) layer 50, etc., and output layer 60.Many other variations of input, hidden and output layers arecontemplated.

Input layer 20 includes one or more input nodes 21, etc., that presentthe input data, such as a color image, as an input volume to the firstconvolutional layer, e.g., convolutional layer 30-1. The input volume isa three-dimensional matrix that has a width, a height and a depth. Forexample, input data that represent a color image are presented as aninput volume that is 512 pixels×512 pixels×3 channels (red, green,blue); other input volume dimensions may also be used, such as 32×32×3,64×64×3, 128×128×3, etc., 32×32×1, 64×64×1, 128×128×1, 512×512×1, etc.

Convolutional layer 30-1 is locally-connected to input layer 20, andincludes a plurality of nodes that are connected to local regions in theinput volume (not depicted for clarity). For a CNN that uses a standardconvolution, each node computes a dot product between the node's weightsand the respective local region of the input volume. An activationfunction is then applied to the results of each convolution calculationto produce an output volume that is provided as an input volume to thesubsequent layer. The activation function may be applied by eachconvolutional layer node or by the nodes of a subsequentlocally-connected ReLU layer.

Pooling layer 30-2 is locally-connected to convolutional layer 30-1, andincludes a plurality of nodes that are connected to local regions in theinput volume (not depicted for clarity). Pooling layer 30-2 alsoproduces an output volume that is provided as the input volume to thesubsequent layer, such as, for example, another convolutional layer30-1, a flatten layer 40, etc. In certain embodiments, convolutionallayer 30-1 and pooling layer 30-2 form a single hidden layer 30.Similarly, in certain embodiments, convolutional layer 30-1, a ReLUlayer and pooling layer 30-2 form a single hidden layer 30. Generally,the output volumes of the convolutional and pooling layers may bedescribed as feature maps, and one or more single hidden layers 30 forma feature learning portion of CNN 15.

Hidden layer 40 is a “flatten” layer that is locally-connected topooling layer 30-2, and includes one or more hidden (flatten) nodes 41,42, 43, 44, 45, etc. Hidden (flatten) layer 40 “flattens” the outputvolume produced by the preceding pooling layer 30-2 into a columnvector, which is provided to the subsequent, fully-connected hiddenlayer 50.

Hidden layer 50 is a classification layer that is fully-connected tohidden (flatten) layer 40, and includes one or more hidden(classification) nodes 51, 52, 53, 54, 55, etc.

Output layer 60 includes one or more output nodes 61, 62, etc., and isfully-connected to hidden (classification) layer 50. Fully-connectedoutput layer 60 receives the classification results output by hidden(classification) layer 50, and each node outputs a predicted classscore. A normalization function, such as a Softmax function, may beapplied to the predicted class scores by output layer 60, or,alternatively, by an additional layer interposed between hidden(classification) layer 50 and output layer 60.

Similar to ANNs, training a CNN includes optimizing the connectionweights between nodes by minimizing the prediction error of the outputdata until the CNN achieves a particular level of accuracy. As notedabove, backpropagation may be used to iteratively and recursivelydetermines a gradient descent with respect to the connection weights,and then adjusts the connection weights to improve the performance ofthe network. Matrix multiplication operations, and, more particularly,multiply-and-accumulate (MAC) operations, are used extensively by CNNs,as well as other ANNs.

FIG. 3A depicts convolutional layer calculation 200 for a CNN, inaccordance with an embodiment of the present disclosure.

Input feature maps 204 include four channels and one input data matrixfor each channel, i.e., input data matrices 204 ¹, 204 ², 204 ³ and 204⁴. Filter 202 includes four filter or weight sets 202 ¹, 202 ², 202 ³and 202 ⁴, and each filter or weight set includes four weight matrices,one weight matrix for each channel. Output feature maps 206 include fourchannels and one output data matrix for each filter or weight set, i.e.,output data matrices 206 ¹, 206 ², 206 ³ and 206 ⁴. Convolutional layercalculation 200 convolves filter 202 with input feature maps 204 toproduce output feature maps 206.

Generally, input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴ form aninput tensor, each weight set 202 ¹, 202 ², 202 ³ and 202 ⁴ forms aweight tensor, and output data matrices 206 ¹, 206 ², 206 ³ and 206 ⁴form an output tensor. In this embodiment, each tensor has a height, awidth and a depth. The depth of the input tensor is equal to the numberof channels, the depth of each weight tensor is equal to the number ofchannels, and the depth of the output tensor is equal to the number ofweight tensors (i.e., weight sets). While particular dimensions for thetensors and matrices have been selected for clarity of illustration andexplanation, embodiments of the present disclosure are not so limited.

In one embodiment, input data matrix 204 ¹ is a 5×5 matrix associatedwith the first channel and includes activations a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄,a¹ ₅, a¹ ₆, a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅,a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄ and a¹ ₂₅.Input data matrix 204 ² is a 5×5 matrix associated with the secondchannel and includes activations a² ₁, a² ₂, a² ₃, a² ₄, a² ₅, a² ₆, a²₇, a² ₈, a² ₉, a² ₁₀, a² ₁₁, a² ₁₂, a² ₁₃, a² ₁₄, a² ₁₅, a² ₁₆, a² ₁₇,a² ₁₈, a² ₁₉, a² ₂₀, a² ₂₁, a² ₂₂, a² ₂₃, a² ₂₄ and a² ₂₅. Input datamatrix 204 ³ is a 5×5 matrix associated with the third channel andincludes activations a³ ₁, a³ ₂, a³ ₃, a³ ₄, a³ ₅, a³ ₆, a³ ₇, a³ ₈, a³₉, a³ ₁₀, a³ ₁₁, a³ ₁₂, a³ ₁₃, a³ ₁₄, a³ ₁₅, a³ ₁₆, a³ ₁₇, a³ ₁₈, a³ ₁₉,a³ ₂₀, a³ ₂₁, a³ ₂₂, a³ ₂₃, a³ ₂₄ and a³ ₂₅. Input data matrix 204 ⁴ isa 5×5 matrix associated with the fourth channel and includes activationsa⁴ ₁, a⁴ ₂, a⁴ ₃, a⁴ ₄, a⁴ ₅, a⁴ ₆, a⁴ ₇, a⁴ ₈, a⁴ ₉, a⁴ ₁₀, a⁴ ₁₁, a⁴₁₂, a⁴ ₁₃, a⁴ ₁₄, a⁴ ₁₅, a⁴ ₁₆, a⁴ ₁₇, a⁴ ₁₈, a⁴ ₁₉, a⁴ ₂₀, a⁴ ₂₁, a⁴₂₂, a⁴ ₂₃, a⁴ ₂₄ and a⁴ ₂₅.

In this embodiment, weight set 202 ¹ includes four weight matrices. Thefirst weight matrix is a 2×2 matrix associated with the first channel,and includes weights w₁, w¹ ₂, w¹ ₃ and w¹ ₄. The second weight matrixis a 2×2 matrix associated with the second channel, and includes weightsw¹ ₅, w¹ ₆, w¹ ₇ and w¹ ₈. The third weight matrix is a 2×2 matrixassociated with the third channel, and includes weights w¹ ₉, w¹ ₁₀, w¹₁₁ and w¹ ₁₂. The fourth weight matrix is a 2×2 matrix associated withthe fourth channel, and includes weights w¹ ₁₃, w¹ ₁₄, w¹ ₁₅ and w1₁₆.

Weight set 202 ² includes four weight matrices. The first weight matrixis a 2×2 matrix associated with the first channel, and includes weightsw² ₁, w² ₂, w² ₃ and w² ₄. The second weight matrix is a 2×2 matrixassociated with the second channel, and includes weights w² ₅, w² ₆, w²₇ and w² ₈. The third weight matrix is a 2×2 matrix associated with thethird channel, and includes weights w² ₉, w² ₁₀, w² ₁₁ and w² ₁₂. Thefourth weight matrix is a 2×2 matrix associated with the fourth channel,and includes weights w² ₁₃, w² ₁₄, w² ₁₅ and w² ₁₆.

Weight set 202 ³ includes four weight matrices. The first weight matrixis a 2×2 matrix associated with the first channel, and includes weightsw³ ₁, w³ ₂, w³ ₃ and w³ ₄. The second weight matrix is a 2×2 matrixassociated with the second channel, and includes weights w³ ₅, w³ ₆, w³₇ and w³ ₈. The third weight matrix is a 2×2 matrix associated with thethird channel, and includes weights w³ ₉, w³ ₁₀, w³ ₁₁ and w³ ₁₂. Thefourth weight matrix is a 2×2 matrix associated with the fourth channel,and includes weights w³ ₁₃, w³ ₁₄, w³ ₁₅ and w³ ₁₆.

Weight set 202 ⁴ includes four weight matrices. The first weight matrixis a 2×2 matrix associated with the first channel, and includes weightsw⁴ ₁, w⁴ ₂, w⁴ ₃ and w⁴ ₄. The second weight matrix is a 2×2 matrixassociated with the second channel, and includes weights w⁴ ₅, w⁴ ₆, w⁴₇ and w⁴ ₈. The third weight matrix is a 2×2 matrix associated with thethird channel, and includes weights w⁴ ₉, w⁴ ₁₀, w⁴ ₁₁ and w⁴ ₁₂. Thefourth weight matrix is a 2×2 matrix associated with the fourth channel,and includes weights w⁴ ₁₃, w⁴ ₁₄, w⁴ ₁₅ and w⁴ ₁₆.

In this embodiment, output data matrix 206 ¹ is a 4×4 matrix associatedwith weight set 202 ¹ and includes activations o¹ ₁, o¹ ₂, o¹ ₃, o¹ ₄,o¹ ₅, o¹ ₆, o¹ ₇, o¹ ₈, o¹ ₉, o¹ ₁₀, o¹ ₁₁, o¹ ₁₂, o¹ ₁₃, o¹ ₁₄, o¹ ₁₅and o¹ ₁₆. Output data matrix 206 ² is a 4×4 matrix associated withweight set 202 ² and includes activations o² ₁, o² ₂, o² ₃, o² ₄, o² ₅,o² ₆, o² ₇, o² ₈, o² ₉, o² ₁₀, o² ₁₁, o² ₁₂, o² ₁₃, o² ₁₄, o² ₁₅ and o²₁₆. Output data matrix 206 ³ is a 4×4 matrix associated with weight set202 ³ and includes activations o³ ₁, o³ ₂, o³ ₃, o³ ₄, o³ ₅, o³ ₆, o³ ₇,o³ ₈, o³ ₉, o³ ₁₀, o³ ₁₁, o³ ₁₂, o³ ₁₃, o³ ₁₄, o³ ₁₅ and o³ ₁₆. Outputdata matrix 206 ⁴ is a 4×4 matrix associated with weight set 202 ⁴ andincludes activations o⁴ ₁, o⁴ ₂, o⁴ ₃, o⁴ ₄, o⁴ ₅, o⁴ ₆, o⁴ ₇, o⁴ ₈, o⁴₉, o⁴ ₁₀, o⁴ ₁₁, o⁴ ₁₂, o⁴ ₁₃, o⁴ ₁₄, o⁴ ₁₅ and o⁴ ₁₆.

For ease of explanation, each input data matrix 204 ¹, 204 ², 204 ³ and204 ⁴ may be divided into four quadrants. The first quadrant spans thetop (first) row and the second row, the second quadrant spans the secondrow and the third row, the third quadrant spans the third row and thefourth row, and the fourth quadrant spans the fourth row and the fifth(bottom) row. The first quadrant for input data matrix 204 ¹ (a¹ _(q1)),the first quadrant for input data matrix 204 ² (a² _(q1)), the firstquadrant for input data matrix 204 ³ (a³ _(q1)), and the first quadrantfor input data matrix 204 ⁴ (a⁴ _(q1)) are depicted; the remaining threequadrants for each input data matrix are not depicted for clarity.

First quadrant a¹ _(q1) includes elements a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₅,a¹ ₆, a¹ ₇, a¹ ₈, a¹ ₉ and a¹ ₁₀, from which four blocks of elements areformed, i.e., a first block (a¹ ₁, a¹ ₂, a¹ ₆ and a¹ ₇), a second block(a¹ ₂, a¹ ₃, a¹ ₇ and a¹ ₈), a third block (a¹ ₃, a¹ ₄, a¹ ₈ and a¹ ₉),and a fourth block (a¹ ₄, a¹ ₅, a¹ ₉ and a¹ ₁₀). First quadrant a² _(q1)includes elements a² ₁, a² ₂, a² ₃, a² ₄, a² ₅, a² ₆, a² ₇, a² ₈, a² ₉and a² ₁₀, from which four blocks of elements are formed, i.e., a firstblock (a² ₁, a² ₂, a² ₆ and a² ₇), a second block (a² ₂, a² ₃, a² ₇ anda² ₈), a third block (a² ₃, a² ₄, a² ₈ and a² ₉), and a fourth block (a²₄, a² ₅, a² ₉ and a² ₁₀). First quadrant a³ _(q1) includes elements a³₁, a³ ₂, a³ ₃, a³ ₄, a³ ₅, a³ ₆, a³ ₇, a³ ₈, a³ ₉ and a³ ₁₀, from whichfour blocks of elements are formed, i.e., a first block (a³ ₁, a³ ₂, a³₆ and a³ ₇), a second block (a³ ₂, a³ ₃, a³ ₇ and a³ ₈), a third block(a³ ₃, a³ ₄, a³ ₈ and a³ ₉), and a fourth block (a³ ₄, a³ ₅, a³ ₉ and a³₁₀). First quadrant a⁴ _(q1) includes elements a⁴ ₁, a⁴ ₂, a⁴ ₃, a⁴ ₄,a⁴ ₅, a⁴ ₆, a⁴ ₇, a⁴ ₈, a⁴ ₉ and a⁴ ₁₀, from which four blocks ofelements are formed, i.e., a first block (a⁴ ₁, a⁴ ₂, a⁴ ₆ and a⁴ ₇), asecond block (a⁴ ₂, a⁴ ₃, a⁴ ₇ and a⁴ ₈), a third block (a⁴ ₃, a⁴ ₄, a⁴₈ and a⁴ ₉), and a fourth block (a⁴ ₄, a⁴ ₅, a⁴ ₉ and a⁴ ₁₀).

Second quadrant a¹ _(q2) includes elements a¹ ₆, a¹ ₇, a¹ ₈, a¹ ₉, a¹₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₃, a¹ ₁₄ and a¹ ₁₅, from which four blocks ofelements are formed, i.e., a first block (a¹ ₆, a¹ ₇, a¹ ₁₁ and a¹ ₁₂),a second block (a¹ ₇, a¹ ₈, a¹ ₁₂ and a¹ ₁₃), a third block (a¹ ₈, a¹ ₉,a¹ ₁₃ and a¹ ₁₄), and a fourth block (a¹ ₉, a¹ ₁₀, a¹ ₁₄ and a¹ ₁₅).Second quadrant a² _(q2) includes elements a² ₆, a² ₇, a² ₈, a² ₉, a²₁₀, a² ₁₁, a² ₁₂, a² ₁₃, a² ₁₄ and a² ₁₅, from which four blocks ofelements are formed, i.e., a first block (a² ₆, a² ₇, a² ₁₁ and a² ₁₂),a second block (a² ₇, a² ₈, a² ₁₂ and a² ₁₃), a third block (a² ₈, a² ₉,a² ₁₃ and a² ₁₄), and a fourth block (a² ₉, a² ₁₀, a² ₁₄ and a² ₁₅).Second quadrant a³ _(q2) includes elements a³ ₆, a³ ₇, a³ ₈, a³ ₉, a³₁₀, a³ ₁₁, a³ ₁₂, a³ ₁₃, a³ ₁₄ and a³ ₁₅, from which four blocks ofelements are formed, i.e., a first block (a³ ₆, a³ ₇, a³ ₁₁ and a³ ₁₂),a second block (a³ ₇, a³ ₈, a³ ₁₂ and a³ ₁₃), a third block (a³ ₈, a³ ₉,a³ ₁₃ and a³ ₁₄), and a fourth block (a³ ₉, a³ ₁₀, a³ ₁₄ and a³ ₁₅).Second quadrant a⁴ _(q2) includes elements a⁴ ₆, a⁴ ₇, a⁴ ₈, a⁴ ₉, a⁴₁₀, a⁴ ₁₁, a⁴ ₁₂, a⁴ ₁₃, a⁴ ₁₄ and a⁴ ₁₅, from which four blocks ofelements are formed, i.e., a first block (a⁴ ₆, a⁴ ₇, a⁴ ₁₁ and a⁴ ₁₂),a second block (a⁴ ₇, a⁴ ₈, a⁴ ₁₂ and a⁴ ₁₃), a third block (a⁴ ₈, a⁴ ₉,a⁴ ₁₃ and a⁴ ₁₄), and a fourth block (a⁴ ₉, a⁴ ₁₀, a⁴ ₁₄ and a⁴ ₁₅).

Third quadrant a¹ _(q3) includes elements a¹ ₁₁, a¹ ₁₂, a¹ ₁₃, a¹ ₁₄, a¹₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉ and a¹ ₂₀, from which four blocks ofelements are formed, i.e., a first block (a¹ ₁₁, a¹ ₁₂, a¹ ₁₆ and a¹₁₇), a second block (a¹ ₁₂, a¹ ₁₃, a¹ ₁₇ and a¹ ₁₈), a third block (a¹₁₃, a¹ ₁₄, a¹ ₁₈ and a¹ ₁₉), and a fourth block (a¹ ₁₄, a¹ ₁₅, a¹ ₁₉ anda¹ ₂₀). Third quadrant a² _(q3) includes elements a² ₁₁, a² ₁₂, a² ₁₃,a² ₁₄, a² ₁₅, a² ₁₆, a² ₁₇, a² ₁₈, a² ₁₉ and a² ₂₀, from which fourblocks of elements are formed, i.e., a first block (a² ₁₁, a² ₁₂, a² ₁₆and a² ₁₇), a second block (a² ₁₂, a² ₁₃, a² ₁₇ and a² ₁₈), a thirdblock (a² ₁₃, a² ₁₄, a² ₁₈ and a² ₁₉), and a fourth block (a² ₁₄, a² ₁₅,a² ₁₉ and a² ₂₀). Third quadrant a³ _(q3) includes elements a³ ₁₁, a³₁₂, a³ ₁₃, a³ ₁₄, a³ ₁₅, a³ ₁₆, a³ ₁₇, a³ ₁₈, a³ ₁₉ and a³ ₂₀, fromwhich four blocks of elements are formed, i.e., a first block (a³ ₁₁, a³₁₂, a³ ₁₆ and a³ ₁₇), a second block (a³ ₁₂, a³ ₁₃, a³ ₁₇ and a³ ₁₈), athird block (a³ ₁₃, a³ ₁₄, a³ ₁₈ and a³ ₁₉), and a fourth block (a³ ₁₄,a³ ₁₅, a³ ₁₉ and a³ ₂₀). Third quadrant a⁴ _(q3) includes elements a⁴₁₁, a⁴ ₁₂, a⁴ ₁₃, a⁴ ₁₄, a⁴ ₁₅, a⁴ ₁₆, a⁴ ₁₇, a⁴ ₁₈, a⁴ ₁₉ and a⁴ ₂₀,from which four blocks of elements are formed, i.e., a first block (a⁴₁₁, a⁴ ₁₂, a⁴ ₁₆ and a⁴ ₁₇), a second block (a⁴ ₁₂, a⁴ ₁₃, a⁴ ₁₇ and a⁴₁₈), a third block (a⁴ ₁₃, a⁴ ₁₄, a⁴ ₁₈ and a⁴ ₁₉), and a fourth block(a⁴ ₁₄, a⁴ ₁₅, a⁴ ₁₉ and a⁴ ₂₀).

Fourth quadrant a¹ _(q4) includes elements a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉,a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄ and a¹ ₂₅, from which four blocks ofelements are formed, i.e., a first block (a¹ ₁₆, a¹ ₁₇, a¹ ₂₁ and a¹₂₂), a second block (a¹ ₁₇, a¹ ₁₈, a¹ ₂₂ and a¹ ₂₃), a third block (a¹₁₈, a¹ ₁₉, a¹ ₂₃ and a¹ ₂₄), and a fourth block (a¹ ₁₉, a¹ ₂₀, a¹ ₂₄ anda¹ ₂₅). Fourth quadrant a² _(q4) includes elements a² ₁₆, a² ₁₇, a² ₁₈,a² ₁₉, a² ₂₀, a² ₂₁, a² ₂₂, a² ₂₃, a² ₂₄ and a² ₂₅, from which fourblocks of elements are formed, i.e., a first block (a² ₁₆, a² ₁₇, a² ₂₁and a² ₂₂), a second block (a² ₁₇, a² ₁₈, a² ₂₂ and a² ₂₃), a thirdblock (a² ₁₈, a² ₁₉, a² ₂₃ and a² ₂₄), and a fourth block (a² ₁₉, a² ₂₀,a² ₂₄ and a² ₂₅). Fourth quadrant a³ _(q4) includes elements a³ ₁₆, a³₁₇, a³ ₁₈, a³ ₁₉, a³ ₂₀, a³ ₂₁, a³ ₂₂, a³ ₂₃, a³ ₂₄ and a³ ₂₅, fromwhich four blocks of elements are formed, i.e., a first block (a³ ₁₆, a³₁₇, a³ ₂₁ and a³ ₂₂), a second block (a³ ₁₇, a³ ₁₈, a³ ₂₂ and a³ ₂₃), athird block (a³ ₁₈, a³ ₁₉, a³ ₂₃ and a³ ₂₄), and a fourth block (a³ ₁₉,a³ ₂₀, a³ ₂₄ and a³ ₂₅). Fourth quadrant a⁴ _(q4) includes elements a⁴₁₆, a⁴ ₁₇, a⁴ ₁₈, a⁴ ₁₉, a⁴ ₂₀, a⁴ ₂₁, a⁴ ₂₂, a⁴ ₂₃, a⁴ ₂₄ and a⁴ ₂₅,from which four blocks of elements are formed, i.e., a first block (a⁴₁₆, a⁴ ₁₇, a⁴ ₂₁ and a⁴ ₂₂), a second block (a⁴ ₁₇, a⁴ ₁₈, a⁴ ₂₂ and a⁴₂₃), a third block (a⁴ ₁₈, a⁴ ₁₉, a⁴ ₂₃ and a⁴ ₂₄), and a fourth block(a⁴ ₁₉, a⁴ ₂₀, a⁴ ₂₄ and a⁴ ₂₅).

Output feature maps 206 may also be divided into four quadrants; in thiscase, each quadrant spans all four output data matrices 206 ¹, 206 ²,206 ³ and 206 ⁴. The first quadrant spans the top (first) row of eachoutput data matrix, the second quadrant spans the second row of eachoutput data matrix, the third quadrant spans the third row of eachoutput data matrix, and the fourth quadrant spans the fourth (bottom)row of each output data matrix. The first quadrant for output featuremaps 206 (o_(q1)), is depicted; the remaining three quadrants are notdepicted for clarity.

First quadrant o_(q1) includes o^(q1), o¹ ₂, o¹ ₃, o¹ ₄, o² ₁, o² ₂, o²₃, o² ₄, o³ ₁, o³ ₂, o³ ₃, o³ ₄, o⁴ ₁, o⁴ ₂, o⁴ ₃ and o⁴ ₄. Secondquadrant O_(q2) includes o¹ ₅, o¹ ₆, o¹ ₇, o¹ ₈, o² ₅, o² ₆, o² ₇, o² ₈,o³ ₅, o³ ₆, o³ ₇, o³ ₈, o⁴ ₅, o⁴ ₆, o⁴ ₇ and o⁴ ₈. Third quadrant o_(q3)includes o¹ ₉, o¹ ₁₀, o¹ ₁₁, o¹ ₁₂, o² ₉, o² ₁₀, o² ₁₁, o² ₁₂, o³ ₉, o³₁₀, o³ ₁₁, o³ ₁₂, o⁴ ₉, o⁴ ₁₀, o⁴ ₁₁ and o⁴ ₁₂. Fourth quadrant o_(q4)includes o¹ ₁₃, o¹ ₁₄, o¹ ₁₅, o¹ ₁₆, o² ₁₃, o² ₁₄, o² ₁₅, o² ₁₆, o³ ₁₃,o³ ₁₄, o³ ₁₅, o³ ₁₆, o⁴ ₁₃, o⁴ ₁₄, o⁴ ₁₅ and o⁴ ₁₆.

Generally, each output element within output data matrices 206 ¹, 206 ²,206 ³ and 206 ⁴ is the sum of the dot products of one of the weight sets202 ¹, 202 ², 202 ³ and 202 ⁴ and a block of activation elements withina particular quadrant of input data matrices 204 ¹, 204 ², 204 ³ and 204⁴.

The calculation of the output elements in quadrant o_(q1) follows.

Output element o¹ ₁ of output data matrix 206 ¹ is the sum of the dotproducts of weight set 202 ¹ and the first block of activation elementswithin first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1) ofinput data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴, respectively. Thefirst block of activation elements within first quadrants a¹ _(q1), a²_(q1), a³ _(q1) and a⁴ _(q1) includes a¹ ₁, a¹ ₂, a¹ ₆ and a¹ ₇; a² ₁,a² ₂, a² ₆ and a² ₇; a³ ₁, a³ ₂, a³ ₆ and a³ ₇; and a⁴ ₁, a⁴ ₂, a⁴ ₆ anda⁴ ₇, respectively.

More particularly, the following dot products are summed to generateoutput element o¹ ₁: the dot product of the first weight matrix ofweight set 202 ¹ and the first block of quadrant a¹ _(q1) (i.e., w¹ ₁·a¹₁+w¹ ₂·a¹ ₂+w¹ ₃·a¹ ₆+w¹ ₄·a¹ ₇), the dot product of the second weightmatrix of weight set 202 ¹ and the first block of quadrant a² _(q1)(i.e., w¹ ₅·a² ₁+w¹ ₆·a² ₂+w¹ ₇·a² ₆+w¹ ₈·a² ₇), the dot product of thethird weight matrix of weight set 202 ¹ and the first block of quadranta³ _(q1) (i.e., w¹ ₉ a³ ₁+w¹ ₁₀·a³ ₂+w¹ ₁₁·a³ ₆+w¹ ₁₂·a³ ₇), and the dotproduct of the fourth weight matrix of weight set 202 ¹ and the firstblock of quadrant a⁴ _(q1) (i.e., w¹ ₁₃·a⁴ ₁+w¹ ₁₄·a⁴ ₂+w¹ ₁₅·a⁴ ₆+w¹₁₆·a⁴ ₇).

Similarly, output element o² ₁ of output data matrix 206 ² is the sum ofthe dot products of weight set 202 ² and the first block of activationelements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴_(q1) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. Output element o³ ₁ of output data matrix 206 ³ is the sumof the dot products of weight set 202 ³ and the first block ofactivation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1)and a⁴ _(q1) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And, output element o⁴ ₁ of output data matrix 206 ⁴ isthe sum of the dot products of weight set 202 ⁴ and the first block ofactivation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1)and a⁴ _(q1) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively.

Output element o¹ ₂ of output data matrix 206 ¹ is the sum of the dotproducts of weight set 202 ¹ and the second block of activation elementswithin the first quadrants a¹ _(q1), q² _(q1), a³ _(q1) and a⁴ _(q1) ofinput data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴, respectively. Thesecond block of activation elements within the first quadrants a¹ _(q1),a² _(q1), a³ _(q1) and a⁴ _(q1) includes a¹ ₂, a¹ ₃, a¹ ₇ and a¹ ₈; a²₂, a² ₃, a² ₇ and a² ₈; a³ ₂, a³ ₃, a³ ₇ and a³ ₈; and a⁴ ₂, a⁴ ₃, a⁴ ₇and a⁴ ₈, respectively.

More particularly, the following dot products are summed to generateoutput element o¹ ₂: the dot product of the first weight matrix ofweight set 202 ¹ and the second block of quadrant a¹ _(q1) (i.e., w¹₁·a¹ ₂+w¹ ₂·a¹ ₃+w¹ ₃·a¹ ₇+w¹ ₄·a¹ ₈), the dot product of the secondweight matrix of weight set 202 ¹ and the second block of quadrant a²_(q1) (i.e., w¹ ₅·a² ₂+w¹ ₆·a² ₃+w¹ ₇·a² ₇+w¹ ₈·a² ₈), the dot productof the third weight matrix of weight set 202 ¹ and the second block ofquadrant a³ _(q1) (i.e., w¹ ₉·a³ ₂+w¹ ₁₀·a³ ₃+w¹ ₁₁·a³ ₇+w¹ ₁₂·a³ ₈),and the dot product of the fourth weight matrix of weight set 202 ¹ andthe second block of quadrant a⁴ _(q1) (i.e., w¹ ₁₃·a⁴ ₂+w¹ ₁₄·a⁴ ₃+w¹₁₅·a⁴ ₇+w¹ ₁₆·a⁴ ₈).

Similarly, output element o² ₂ of output data matrix 206 ² is the sum ofthe dot products of weight set 202 ² and the second block of activationelements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴_(q1) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. Output element o³ ₂ of output data matrix 206 ³ is the sumof the dot products of weight set 202 ³ and the second block ofactivation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1)and a⁴ _(q1) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And, output element o⁴ ₂ of output data matrix 206 ⁴ isthe sum of the dot products of weight set 202 ⁴ and the second block ofactivation elements within the quadrants a¹ _(q1), a² _(q1), a³ _(q1)and a⁴ _(q1) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively.

And so on for output elements o¹ ₃ and o¹ ₄, o² ₃ and o² ₄, o³ ₃ and o³₄, and o⁴ ₃ and o⁴ ₄ of the first rows of output data matrices 206 ¹,206 ², 206 ³ and 206 ⁴.

With respect to quadrant o_(q2), output element o¹ ₅ of output datamatrix 206 ¹ is the sum of the dot products of weight set 202 ¹ and thefirst block of activation elements within second quadrants a¹ _(q2), a²_(q2), a³ _(q2) and a⁴ _(q2) of input data matrices 204 ¹, 204 ², 204 ³and 204 ⁴, respectively. Output element o² ₅ of output data matrix 206 ²is the sum of the dot products of weight set 202 ² and the first blockof activation elements within second quadrants a¹ _(q2), a² _(q2), a³_(q2) and a⁴ _(q2) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. Output element o³ ₅ of output data matrix 206 ³ is the sumof the dot products of weight set 202 ³ and the first block ofactivation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2)and a⁴ _(q2) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And, output element o⁴ ₅ of output data matrix 206 ⁴ isthe sum of the dot products of weight set 202 ⁴ and the first block ofactivation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2)and a⁴ _(q2) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And so on for output elements o¹ ₆, o¹ ₇ and o¹ ₈, o² ₆,o² ₇ and o² ₈, o³ ₆, o³ ₇ and o³ ₈, and o⁴ ₆, o⁴ ₇ and o⁴ ₈ of thesecond rows of output data matrices 206 ¹, 206 ², 206 ³ and 206 ⁴.

With respect to quadrant o_(q3), output element o¹ ₉ of output datamatrix 206 ¹ is the sum of the dot products of weight set 202 ¹ and thefirst block of activation elements within third quadrants a¹ _(q3), a²_(q3), a³ _(q3) and a⁴ _(q3) of input data matrices 204 ¹, 204 ², 204 ³and 204 ⁴, respectively. Output element o² ₉ of output data matrix 206 ²is the sum of the dot products of weight set 202 ² and the first blockof activation elements within third quadrants a¹ _(q3), a² _(q3), a³_(q3) and a⁴ _(q3) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. Output element o³ ₉ of output data matrix 206 ³ is the sumof the dot products of weight set 202 ³ and the first block ofactivation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3)and a⁴ _(q3) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And, output element o⁴ ₉ of output data matrix 206 ⁴ isthe sum of the dot products of weight set 202 ⁴ and the first block ofactivation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3)and a⁴ _(q3) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And so on for output elements o¹ ₁₀, o¹ ₁₁ and o¹ ₁₂, o²₁₀, o² ₁₁ and o² ₁₂, o³ ₁₀, o³ ₁₁ and o³ ₁₂, and o⁴ ₁₀, o⁴ ₁₁ and o⁴ ₁₂of the third rows of output data matrices 206 ¹, 206 ², 206 ³ and 206 ⁴.

With respect to quadrant o_(q4), output element o¹ ₁₃ of output datamatrix 206 ¹ is the sum of the dot products of weight set 202 ¹ and thefirst block of activation elements within fourth quadrants a¹ _(q4), a²_(q4), a³ _(q4) and a⁴ _(q4) of input data matrices 204 ¹, 204 ², 204 ³and 204 ⁴, respectively. Output element o² ₁₃ of output data matrix 206² is the sum of the dot products of weight set 202 ² and the first blockof activation elements within fourth quadrants a¹ _(q4), a² _(q4), a³_(q4) and a⁴ _(q4) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. Output element o³ ₁₃ of output data matrix 206 ³ is thesum of the dot products of weight set 202 ³ and the first block ofactivation elements within fourth quadrants a¹ _(q4), a² _(q4), a³ _(q4)and a⁴ _(q4) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And, output element o⁴ ₁₃ of output data matrix 206 ⁴ isthe sum of the dot products of weight set 202 ⁴ and the first block ofactivation elements within third quadrants a¹ _(q4), a² _(q4), a³ _(q4)and a⁴ _(q4) of input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴,respectively. And so on for output elements o¹ ₁₄, o¹ ₁₅ and o¹ ₁₆, o²₁₄, o² ₁₅ and o² ₁₆, o³ ₁₄, o³ ₁₅ and o³ ₁₆, and o⁴ ₁₄, o⁴ ₁₅ and o⁴ ₁₆of the fourth rows of output data matrices 206 ¹, 206 ², 206 ³ and 206⁴.

FIG. 3B depicts converted convolutional layer calculation 210 for a CNN,while FIG. 3C depicts converted input data matrix 214, in accordancewith an embodiment of the present disclosure.

In one embodiment, the convolutional layer calculations for CNNsexecuting on central processor units (CPUs) may be converted intogeneric matrix multiplication (GEMM) operations, which may leverageGEMM-optimized software libraries. Convolution layer calculation 200 isconverted into a GEMM operation by converting filters 202 into convertedweight matrix 212, converting input feature maps 204 into convertedinput data matrix 214, and then multiplying converted weight matrix 212and converted input data matrix 214 to generate converted output datamatrix 216. Because simple matrix multiplication is performed ratherthan a convolution operation, each output element within convertedoutput data matrix 216 is the dot product of one row of converted weightmatrix 212 and one column of converted input data matrix 214. Convertedoutput data matrix 216 is then reformed into output feature maps 206.

Converted weight matrix 212 is a 4×16 matrix, and includes convertedweight sets 212 ¹, 212 ², 212 ³ and 212 ⁴. Weight set 202 ¹ is flattenedto form converted weight set 212 ¹, i.e., the first row, and includesweights w¹ ₁, w¹ ₂, w¹ ₃, w¹ ₄, w¹ ₅, w¹ ₆, w¹ ₇, w¹ ₈, w¹ ₉, w¹ ₁₀, w¹₁₁, w¹ ₁₂, w¹ ₁₃, w¹ ₁₄, w¹ ₁₅ and w¹ ₁₆. Weight set 202 ² is flattenedto form converted weight set 212 ², i.e., the second row, and includesweights w² ₁, w² ₂, w² ₃, w² ₄, w² ₅, w² ₆, w² ₇, w² ₈, w² ₉, w² ₁₀, w²₁₁, w² ₁₂, w² ₁₃, w² ₁₄, w² ₁₅ and w² ₁₆. Weight set 202 ³ is flattenedto form converted weight set 212 ³, i.e., the third row, and includesweights w³ ₁, w³ ₂, w³ ₃, w³ ₄, w³ ₅, w³ ₆, w³ ₇, w³ ₈, w³ ₉, w³ ₁₀, w³₁₁, w³ ₁₂, w³ ₁₃, w³ ₁₄, w³ ₁₅ and w³ ₁₆. And, weight set 202 ⁴ isflattened to form converted weight set 212 ⁴, i.e., the fourth row, andincludes weights w⁴ ₁, w⁴ ₂, w⁴ ₃, w⁴ ₄, w⁴ ₅, w⁴ ₆, w⁴ ₇,w⁴ ₈, w⁴ ₉, w⁴₁₀, w⁴ ₁₁, w⁴ ₁₂, w⁴ ₁₃, w⁴ ₁₄, w⁴ ₁₅ and w⁴ ₁₆.

Converted input data matrix 214 is a 16×16 matrix, and includes theblocks of each quadrant of input data matrices 204 ¹, 204 ², 204 ³ and204 ⁴, i.e., quadrants a¹ _(q1), a¹ _(q2), a¹ _(q3), a¹ _(q4), a² _(q1),a² _(q2), a² _(q3), a² _(q4), a³ _(q1), a³ _(q2), a³ _(q3), a³ _(q4), a⁴_(q1), a⁴ _(q2), a⁴ _(q3) and a⁴ _(q4), respectively. Generally, eachblock is flattened to form a portion of a single column of convertedinput data matrix 214.

More particularly, the first column of converted input matrix 214includes the first blocks from quadrants a¹ _(q1), a² _(q1), a³ _(q1)and a⁴ _(q1), i.e., activations a¹ ₁, a¹ ₂, a¹ ₆, a¹ ₇, a² ₁, a² ₂, a²₆, a² ₇, a³ ₁, a³ ₂, a³ ₆, a³ ₇, a⁴ ₁, a⁴ ₂, a⁴ ₆ and a⁴ ₇. The secondcolumn of converted input matrix 214 includes the second blocks fromquadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1), i.e., activationsa¹ ₂, a¹ ₃, a¹ ₇, a¹ ₈, a² ₂, a² ₃, a² ₇, a² ₈, a³ ₂, a³ ₃, a³ ₇, a³ ₈,a⁴ ₂, a⁴ ₃, a⁴ ₇ and a⁴ ₈. The third column of converted input matrix214 includes the third blocks from quadrants a¹ _(q1), a² _(q1), a³_(q1) and a⁴ _(q1), i.e., activations a¹ ₃, a¹ ₄, a¹ ₈, a¹ ₉, a² ₃, a²₄, a² ₈, a² ₉, a³ ₃, a³ ₄, a³ ₈, a³ ₉, a⁴ ₃, a⁴ ₄, a⁴ ₈, and a⁴ ₉. And,the fourth column of converted input matrix 214 includes the fourthblocks from quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1), i.e.,activations a¹ ₄, a¹ ₅, a¹ ₉, a¹ ₁₀, a² ₄, a² ₅, a² ₉, a² ₁₀, a³ ₄, a³₅, a³ ₉, a³ ₁₀, a⁴ ₄, a⁴ ₅, a⁴ ₉, and a4₁₀.

The remaining columns of converted input data matrix 214 are formed in asimilar manner. The fourth to the eighth columns are formed from theblocks of quadrants a¹ _(q1), a² _(q2), a³ _(q2) and a⁴ _(q2), the ninthto the twelfth columns are formed from the blocks of quadrants a¹ _(q3),a² _(q3), a³ _(q3) and a⁴ _(q3), and the thirteenth to the sixteenthcolumns are formed from the blocks of quadrants a¹ _(q4), a² _(q4), a³_(q4) and a⁴ _(q4).

Converted output data matrix 216 is a 4×16 matrix, and includesflattened versions of output data matrices 206 ¹, 206 ², 206 ³ and 206⁴, i.e., converted output data matrices 216 ¹, 216 ², 216 ³ and 216 ⁴.Converted output data matrix 216 may also be arranged into fourquadrants o_(q1), o_(q2), o_(q3) and o_(q4), which include the sameoutput elements as the four quadrants o_(q1), o_(q2), o_(q3) and o_(q4)of output feature maps 206.

The calculation of the output elements in the first row of quadranto_(q1) of converted output data matrix 216 follows.

Output element o¹ ₁ is the dot product of the first row of convertedweight matrix 212, i.e., converted weight set 212 ¹, and the firstcolumn of converted input data matrix 214. More particularly, outputelement o¹ ₁ is equal to w¹ ₁·a¹ ₁+w¹ ₂·a¹ ₂+w¹ ₃·a¹ ₆+w¹ ₄·a¹ ₇+w¹ ₅·a²₁+w¹ ₆·a² ₂+w¹ ₇·a² ₆+w¹ ₈·a² ₇+w¹ ₉·a³ ₁+w¹ ₁₀·a³ ₂+w¹ ₁₁·a³ ₆+w¹ ₁₂·a³₇+w¹ ₁₃·a⁴ ₁+w¹ ₁₄·a⁴ ₂+w¹ ₁₅·a⁴ ₆+w¹ ₁₆·a⁴ ₇. As shown above, outputelement oil of converted output data matrix 216 is equal to outputelement oil of output feature maps 206.

Output element o¹ ₂ is the dot product of the first row of convertedweight matrix 212, i.e., converted weight set 212 ¹, and the secondcolumn of converted input data matrix 214. More particularly, outputelement o¹ ₂ is equal to w¹ ₁·a¹ ₂+w¹ ₂·a¹ ₃+w¹ ₃·a¹ ₇+w¹ ₄·a¹ ₈+w¹ ₅·a²₂+w¹ ₆·a² ₃+w¹ ₇·a² ₇+w¹ ₈·a² ₈+w¹ ₉·a³ ₂+w¹ ₁₀·a³ ₃+w¹ ₁₁·a³ ₇+w¹ ₁₂·a³₈+w¹ ₁₃·a⁴ ₂+w¹ ₁₄·a⁴ ₃°w¹ ₁₅·a⁴ ₇+w¹ ₁₆·a⁴ ₈. As shown above, outputelement o¹ ₂ of converted output data matrix 216 is equal to outputelement o¹ ₂ of output feature maps 206.

Output element o¹ ₃ is the dot product of the first row of convertedweight matrix 212, i.e., converted weight set 212 ¹, and the thirdcolumn of converted input data matrix 214. More particularly, outputelement o¹ ₃ is equal to w¹ ₁·a¹ ₃+w¹ ₂·a¹ ₄+w¹ ₃·a¹ ₈+w¹ ₄·a¹ ₉+w¹ ₅·a²₃+w¹ ₆·a² ₄+w¹ ₇·a² ₈+w¹ ₈·a² ₉+w¹ ₉·a³ ₃+w¹ ₁₀·a³ ₄+w¹ ₁₁·a³ ₈+w¹ ₁₂·a³₉+w¹ ₁₃·a⁴ ₃+w¹ ₁₄·a⁴ ₄+w¹ ₁₅·a⁴ ₈+w¹ ₁₆·a⁴ ₉. As shown above, outputelement o¹ ₃ of converted output data matrix 216 is equal to outputelement o¹ ₃ of output feature maps 206.

Output element o¹ ₄ is the dot product of the first row of convertedweight matrix 212, i.e., converted weight set 212 ¹, and the fourthcolumn of converted input data matrix 214. More particularly, outputelement o¹ ₄ is equal to w¹ ₁·a¹ ₄+w¹ ₂·a¹ ₅+w¹ ₃·a¹ ₉+w¹ ₄·a¹ ₁₀+w¹₅·a² ₄·w¹ ₆·a² ₅·w¹ ₇·a² ₉·w¹ ₈·a² ₁₀+w¹ ₉·a³ ₄+w¹ ₁₀·a³ ₅+w¹ ₁₁·a³ ₉+w¹₁₂·a³ ₁₀+w¹ ₁₃·a⁴ ₄+w¹ ₁₄·a⁴ ₅+w¹ ₁₅·a⁴ ₉+w¹ ₁₆·a⁴ ₁₀. As shown above,output element o¹ ₄ of converted output data matrix 216 is equal tooutput element o¹ ₄ of output feature maps 206.

For the second row of quadrant o_(q1), output element o² ₁ is the dotproduct of the second row of converted weight matrix 212, i.e.,converted weight set 212 ², and the first column of converted input datamatrix 214, output element o² ₂ is the dot product of the second row ofconverted weight matrix 212, i.e., converted weight set 212 ², and thesecond column of converted input data matrix 214, output element o² ₃ isthe dot product of the second row of converted weight matrix 212, i.e.,converted weight set 212 ², and the third column of converted input datamatrix 214, and output element o² ₄ is the dot product of the second rowof converted weight matrix 212, i.e., converted weight set 212 ², andthe fourth column of converted input data matrix 214.

For the third row of quadrant o_(q1), output element o³ ₁ is the dotproduct of the third row of converted weight matrix 212, i.e., convertedweight set 212 ³, and the first column of converted input data matrix214, output element o³ ₂ is the dot product of the third row ofconverted weight matrix 212, i.e., converted weight set 212 ³, and thesecond column of converted input data matrix 214, output element o³ ₃ isthe dot product of the third row of converted weight matrix 212, i.e.,converted weight set 212 ³, and the third column of converted input datamatrix 214, and output element o³ ₄ is the dot product of the third rowof converted weight matrix 212, i.e., converted weight set 212 ³, andthe fourth column of converted input data matrix 214.

For the fourth row of quadrant o_(q1), output element o⁴ ₁ is the dotproduct of the fourth row of converted weight matrix 212, i.e.,converted weight set 212 ⁴, and the first column of converted input datamatrix 214, output element o⁴ ₂ is the dot product of the fourth row ofconverted weight matrix 212, i.e., converted weight set 212 ⁴, and thesecond column of converted input data matrix 214, output element o⁴ ₃ isthe dot product of the fourth row of converted weight matrix 212, i.e.,converted weight set 212 ⁴, and the third column of converted input datamatrix 214, and output element o⁴ ₄ is the dot product of the fourth rowof converted weight matrix 212, i.e., converted weight set 212 ⁴, andthe fourth column of converted input data matrix 214.

The elements of the quadrants o_(q2), o_(q3) and o_(q4) are calculatedin a similar manner.

FIG. 4 depicts a data flow diagram 220 for MAC array 218.

As noted above, GEMM operations may be implemented in a dedicated ANNhardware accelerator using an array of MAC units. In this embodiment,MAC array 218 is a systolic, output stationary array that implementsconverted convolution operation 210 using a 4×4 array of MAC units m₁, .. . , m₁₆. The orientation of transposed converted weight matrix 222,transposed converted input data matrix 224, and transposed convertedoutput data matrix 226 relative to MAC array 218 simplifiesillustration; other orientations are also contemplated.

As discussed above, each MAC unit calculates a dot product, between arow of converted weight matrix 212 and a column of converted input datamatrix 214, to generate an element of converted output data matrix 216.Generally, a MAC unit includes, inter alia, a multiplier, an adder and astorage register. Each MAC unit is reset by clearing or zeroing itsstorage register prior to, or at the start of, a new dot productcalculation.

Generally, the rows from converted weight matrix 212 are read from localmemory, enter MAC array 218 at the first row of MAC units m₁, m₂, m₃ andm₄, and propagate one MAC unit down at the beginning of each processingcycle. Similarly, the columns from converted input data matrix 214 areread from local memory, enter MAC array 218 at the first column of MACunits m₁, m₅, m₉ and m₁₃, and propagate one MAC unit to the right at thebeginning of each processing cycle.

The dot product calculations performed by MAC unit m₁ for the blocks ofthe first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1) ofconverted input data matrix 214 are discussed in detail below, while thedot product calculations performed by the remaining MAC units of MACarray 218 are summarized below.

MAC unit m₁ calculates the dot product of the first row of convertedweight matrix 212 (i.e., converted weight set 212 ¹) and the firstcolumn of converted input data matrix 214 to generate element o¹ ₁ ofconverted output data matrix 216. During the processing cycle 1, MACunit mi receives a₁ and w¹ ₁ from local memory, multiplies a₁ and w¹ ₁to generate an intermediate product, adds the intermediate product tothe value stored in the storage register (i.e., 0), and stores theaccumulated result back in the storage register. During processing cycle2, MAC unit m₁ transmits a₁ to MAC unit m₂ and w¹ ₁ to MAC unit m₅,receives a₂ and w¹ ₂ from local memory, multiplies a₂ and w¹ ₂ togenerate an intermediate product, adds the intermediate product to thevalue stored in the storage register, and stores the accumulated resultback in the storage register.

During processing cycle 3, MAC unit m₁ transmits a₂ to MAC unit m₂ andw¹ ₂ to MAC unit m₅, receives a₆ and w¹ ₃ from local memory, multipliesas and w¹ ₃ to generate an intermediate product, adds the intermediateproduct to the value stored in the storage register, and stores theaccumulated result back in the storage register. During processing cycle4, MAC unit m₁ transmits as to MAC unit m₂ and w¹ ₃ to MAC unit m₅,receives a₇ and w¹ ₄ from the local memory, multiplies a₇ and w¹ ₄ togenerate an intermediate product, adds the intermediate product to thevalue stored in the storage register, and stores the accumulated resultback in the storage register.

Processing cycles 5 through 16 multiply and accumulate the remaining 12elements of the first row of converted weight matrix 212 and the firstcolumn of converted input data matrix 214. At the end of the processingcycle 16, MAC unit m₁ outputs element o¹ ₁.

The remainder of the first row of MAC array 218 includes MAC units m₂,m₃ and m₄.

After an initial delay of one processing cycle, MAC unit m₂ receivesweights from the first delay register ff₁ and input data from MAC unitm₁, transmits weights to MAC unit m₆ and input data to MAC unit m₃, andcalculates the dot product of the second row of converted weight matrix212 (i.e., converted weight set 212 ²) and the first column of convertedinput data matrix 214 to generate element o² ₁ of converted output datamatrix 216. The initial delay of one processing cycle allows the delaypipeline (i.e., delay register ff₁) to be filled with weightstransferred from memory, and the input data to become available from MACunit m₁. At the end of the processing cycle 17, MAC unit m₂ outputselement o² ₁.

After an initial delay of two processing cycles, MAC unit m₃ receivesweights from the second delay register ff₂ and input data from MAC unitm₂, transmits weights to MAC unit m₇ and input data to MAC unit m₄, andcalculates the dot product of the third row of converted weight matrix212 (i.e., converted weight set 212 ³) and the first column of convertedinput data matrix 214 to generate element o³ ₁ of converted output datamatrix 216. The initial delay of two processing cycles allows the delaypipeline (i.e., delay registers ff₁ and ff₂) to be filled with weightstransferred from memory, and the input data to become available from MACunit m₂. At the end of processing cycle 18, MAC unit m₃ outputs elemento³ ₁.

After an initial delay of three processing cycles, MAC unit m₄ receivesweights from the third delay register ff₃ and input data from MAC unitm₃, transmits weights to MAC unit m₈, and calculates the dot product ofthe fourth row of converted weight matrix 212 (i.e., converted weightset 212 ⁴) and the first column of converted input data matrix 214 togenerate element o⁴ ₁ of converted output data matrix 216. The initialdelay of three processing cycles allows the delay pipeline (i.e., delayregisters ff₁, ff₂ and ff₃) to be filled with weights transferred frommemory, and the input data to become available from MAC unit m₃. At theend of processing cycle 19, MAC unit m₄ outputs element o⁴ ₁.

The second row of MAC array 218 includes MAC units m₅, m₆, m₇ and m₈.

After an initial delay of one processing cycle, MAC unit m₅ receivesweights from MAC unit m₁ and input data from a first delay register ff₁,transmits weights to MAC unit m₉ and input data to MAC unit m₆, andcalculates the dot product of the first row of converted weight matrix212 (i.e., converted weight set 212 ¹) and the second column ofconverted input data matrix 214 to generate element o¹ ₂ of convertedoutput data matrix 216. The initial delay of one processing cycle allowsthe delay pipeline (i.e., delay register ff₁) to be filled with inputdata transferred from memory, and the weights to become available fromMAC unit m₁. At the end of processing cycle 17, MAC unit m₅ outputselement o¹ ₂.

After an initial delay of two processing cycles, MAC unit m₆ receivesweights from MAC unit m₂ and input data from MAC unit m₅, transmitsweights to MAC unit m₁₀ and input data to MAC unit m₇, and calculatesthe dot product of the second row of converted weight matrix 212 (i.e.,converted weight set 212 ²) and the second column of converted inputdata matrix 214 to generate element o² ₂ of converted output data matrix216. The initial delay of two processing cycles allows the weights tobecome available from MAC unit m₂, and the input data to becomeavailable from MAC unit m₅. At the end of processing cycle 18, MAC unitmo outputs element o² ₂.

After an initial delay of three processing cycles, MAC unit m₇ receivesweights from MAC unit m₃ and input data from MAC unit m₆, transmitsweights to MAC unit m₁₁ and input data to MAC unit m₈, and calculatesthe dot product of the third row of converted weight matrix 212 (i.e.,converted weight set 212 ³) and the second column of converted inputdata matrix 214 to generate element o³ ₂ of converted output data matrix216. The initial delay of three processing cycles allows the weights tobecome available from MAC unit m₃, and the input data to becomeavailable from MAC unit m₆. At the end of processing cycle 19, MAC unitm₇ outputs element o³ ₂.

After an initial delay of four processing cycles, MAC unit m₈ receivesweights from MAC unit m₄ and input data from MAC unit m₇, transmitsweights to MAC unit m₁₂, and calculates the dot product of the fourthrow of converted weight matrix 212 (i.e., converted weight set 212 ⁴)and the second column of converted input data matrix 214 to generateelement o⁴ ₂ of converted output data matrix 216. The initial delay offour processing cycles allows the weights to become available from MACunit m₄, and the input data to become available from MAC unit m₇. At theend of processing cycle 20, MAC unit m₈ outputs element o⁴ ₂.

The third row of MAC array 218 includes MAC units m₉, m₁₀, m₁₁ and m₁₂.

After an initial delay of two processing cycles, MAC unit m₉ receivesweights from MAC unit m₅ and input data from a second delay registerff₂, transmits weights to MAC unit m₁₃ and input data to MAC unit m₁₀,and calculates the dot product of the first row of converted weightmatrix 212 (i.e., converted weight set 212 ¹) and the third column ofconverted input data matrix 214 to generate element o¹ ₃ of convertedoutput data matrix 216. The initial delay of two processing cyclesallows the delay pipeline (i.e., delay registers ff₁ and ff₂) to befilled with input data transferred from memory, and the weights tobecome available from MAC unit m₅. At the end of processing cycle 18,MAC unit m₉ outputs element o¹ ₃.

After an initial delay of three processing cycles, MAC unit m₁₀ receivesweights from MAC unit m₆ and input data from MAC unit m₉, transmitsweights to MAC unit m₁₄ and input data to MAC unit m₁₁, and calculatesthe dot product of the second row of converted weight matrix 212 (i.e.,converted weight set 212 ²) and the third column of converted input datamatrix 214 to generate element o² ₃ of converted output data matrix 216.The initial delay of three processing cycles allows the weights tobecome available from MAC unit m₆, and the input data to becomeavailable from MAC unit m₉. At the end of processing cycle 19, MAC unitm₁₀ outputs element o² ₃.

After an initial delay of four processing cycles, MAC unit m₁₁ receivesweights from MAC unit m₇ and input data from MAC unit m₁₀, transmitsweights to MAC unit m₁₅ and input data to MAC unit m₁₂, and calculatesthe dot product of the third row of converted weight matrix 212 (i.e.,converted weight set 212 ³) and the third column of converted input datamatrix 214 to generate element o³ ₃ of converted output data matrix 216.The initial delay of four processing cycles allows the weights to becomeavailable from MAC unit m₇, and the input data to become available fromMAC unit m₁₀. At the end of processing cycle 20, MAC unit mu outputselement o³ ₃.

After an initial delay of five processing cycles, MAC unit m₁₂ receivesweights from MAC unit m₈ and input data from MAC unit m₁₁, transmitsweights to MAC unit m₁₆, and calculates the dot product of the fourthrow of converted weight matrix 212 (i.e., converted weight set 212 ⁴)and the third column of converted input data matrix 214 to generateelement o⁴ ₃ of converted output data matrix 216. The initial delay offive processing cycles allows the weights to become available from MACunit m₈, and the input data to become available from MAC unit m₁₁. Atthe end of processing cycle 21, MAC unit m₁₂ outputs element o⁴ ₃.

The fourth row of MAC array 218 includes MAC units m₁₃, m₁₄, m₁₅ andm₁₆.

After an initial delay of three processing cycles, MAC unit m₁₃ receivesweights from MAC unit m₉ and input data from a third delay register ff₃,transmits input data to MAC unit m₁₄, and calculates the dot product ofthe first row of converted weight matrix 212 (i.e., converted weight set212 ¹) and the fourth column of converted input data matrix 214 togenerate element o¹ ₄ of converted output data matrix 216. The initialdelay of three processing cycles allows the delay pipeline (i.e., delayregisters ff₁, ff₂ and ff₃) to be filled with input data transferredfrom memory, and the weights to become available from MAC unit m₉. Atthe end of processing cycle 19, MAC unit m₁₃ outputs element o¹ ₄.

After an initial delay of four processing cycles, MAC unit m₁₄ receivesweights from MAC unit m₁₀ and input data from MAC unit m₁₃, transmitsinput data to MAC unit m₁₅, and calculates the dot product of the secondrow of converted weight matrix 212 (i.e., converted weight set 212 ²)and the fourth column of converted input data matrix 214 to generateelement o² ₄ of converted output data matrix 216. The initial delay offour processing cycles allows the weights to become available from MACunit m₁₀, and the input data to become available from MAC unit m₁₃. Atthe end of processing cycle 20, MAC unit m₁₄ outputs element o² ₄.

After an initial delay of five processing cycles, MAC unit m₁₅ receivesweights from MAC unit m₁₁ and input data from MAC unit m₁₄, transmitsinput data to MAC unit m₁₆, and calculates the dot product of the thirdrow of converted weight matrix 212 (i.e., converted weight set 212 ³)and the fourth column of converted input data matrix 214 to generateelement o³ ₄ of converted output data matrix 216. The initial delay offive processing cycles allows the weights to become available from MACunit m₁₁, and the input data to become available from MAC unit m₁₄. Atthe end of processing cycle 21, MAC unit m₁₅ outputs element o³ ₄.

After an initial delay of six processing cycles, MAC unit m₁₆ receivesweights from MAC unit m₁₂ and input data from MAC unit m₁₅, andcalculates the dot product of the fourth row of converted weight matrix212 (i.e., converted weight set 212 ⁴) and the fourth column ofconverted input data matrix 214 to generate element o⁴ ₄ of convertedoutput data matrix 216. The initial delay of six processing cyclesallows the weights to become available from MAC unit m₁₂, and the inputdata to become available from MAC unit m₁₅. At the end of processingcycle 22, MAC unit m₁₆ outputs element o⁴ ₄.

After the blocks of the first quadrants a¹ _(q1), a² _(q1), a³ _(q1) anda⁴ _(q1) of converted input data matrix 214 have been processed, thenext sequence of operations processes the blocks of the second quadrantsa¹ _(q2), a² _(q2), a³ _(q2) and a⁴ _(q2). After the blocks of thesecond quadrants a¹ _(q2), a² _(q2), a³ _(q2) and a⁴ _(q2) have beenprocessed, the next sequence of operations processes the blocks of thethird quadrants a¹ _(q3), a² _(q3), a³ _(q3) and a⁴ _(q3). And, afterthe blocks of the third quadrants a¹ _(q3), a² _(q3), a³ _(q3) and a⁴_(q3) have been processed, the final sequence of operations processesthe blocks of the fourth quadrants a¹ _(q4), a² _(q4), a³ _(q4) and a⁴_(q4). Converted weight matrix 212 is accessed for each sequence ofoperations.

Unfortunately, for CNNs executing on CPUs or other coprocessors, GEMMoperations consume a significant number of processor cycles due to thelarge number of multiplications that are required. For example, oneknown image recognition CNN requires 3 giga operations per second (GOPS)per input data frame. Compounding this problem, many of the matricesupon which the GEMM operations are performed are sparse, which producesa very inefficient use of processing resources. Conversely, if GEMMoperations could significantly reduce “multiply by zero” conditions,processing and power requirements could be significantly reduced.

Certain approaches that attempt to reduce “multiply by zero” conditionscomplicate the GEMM operations and introduce significant processingoverhead on the CPU. Additionally, other approaches fail when applied toactivations due to the dynamic variation of the position of the zerovalues in the input feature maps during the inference. Because of thisdifficulty, ANN accelerators maintain the input feature maps in a denseor uncompressed form, even though the input feature maps typicallycontain a significant amount of activations that have a value of zero(e.g., greater than 60%).

Embodiments of the present disclosure advantageously compress the inputfeature maps during inference by dynamically removing or pruning smallervalue activations (including zero values), lower the memory footprint ofthe activation memory by factor greater than 2×, reduce the powerconsumption related to activation memory accesses (e.g., DRAM, SRAM,cache, etc.), and increase the effective processing throughput by factorgreater than 2× for a MAC array.

FIG. 5A depicts input feature maps 204 from FIG. 3A, in accordance withan embodiment of the present disclosure. As discussed above, inputfeature maps 204 include one 5×5 input data matrix for each channel,i.e., input data matrices 204 ¹, 204 ², 204 ³ and 204 ⁴.

FIG. 5B depicts basic blocks 304, in accordance with an embodiment ofthe present disclosure.

Generally, input feature maps 204 may be decomposed into 25 basic blocks304 along the depth or channel dimension. Each basic block is a 1×1×btensor, and includes b activation values. The depth, b, is ahyperparameter having a value of 4, 8, 16, etc. In this embodiment, bequals 4, i.e., the number of channels.

Basic block 304 ₁ includes activations a¹ ₁, a² ₁, a³ ₁ and a⁴ ₁, basicblock 304 ₂ includes activations a¹ ₂, a² ₂, a³ ₂ and a⁴ ₂, basic block304 ₃ includes activations a¹ ₃, a² ₃, a³ ₃ and a⁴ ₃, basic block 304 ₄includes activations a¹ ₄, a² ₄, a³ ₄ and a⁴ ₄, and basic block 304 ₅includes activations a¹ ₅, a² ₅, a³ ₅ and a⁴ ₅.

Basic blocks 304 ₆, 304 ₇, 304 ₈, 304 ₉, 304 ₁₀, 304 ₁₁, 304 ₁₂, 304 ₁₃,304 ₁₄, 304 ₁₅, 304 ₁₆, 304 ₁₇, 304 ₁₈, 304 ₁₉ and 304 ₂₀, indicated bythe ellipses, are not depicted for clarity. Basic block 304 ₆ includesactivations a¹ ₆, a² ₆, a³ ₆ and a⁴ ₆, basic block 304 ₇ includesactivations a¹ ₇, a² ₇, a³ ₇ and a⁴ ₇, basic block 304 ₈ includesactivations a¹ ₈, a² ₈, a³ ₈ and a⁴ ₈, basic block 304 ₉ includesactivations a¹ ₉, a² ₉, a³ ₉ and a⁴ ₉, basic block 304 ₁₀ includesactivations a¹ ₁₀, a² ₁₀, a³ ₁₀ and a⁴ ₁₀, basic block 304 ₁₁ includesactivations a¹ ₁₁, a² ₁₁, a³ ₁₁ and a⁴ ₁₁, basic block 304 ₁₂ includesactivations a¹ ₁₂, a² ₁₂, a³ ₁₂ and a⁴ ₁₂, basic block 304 ₁₃ includesactivations a¹ ₁₃, a² ₁₃, a³ ₁₃ and a⁴ ₁₃, basic block 304 ₁₄ includesactivations a¹ ₁₄, a² ₁₄, a³ ₁₄ and a⁴ ₁₄, basic block 304 ₁₅ includesactivations a¹ ₁₅, a² ₁₅, a³ ₁₅ and a⁴ ₁₅, basic block 304 ₁₆ includesactivations a¹ ₁₆, a² ₁₆, a³ ₁₆ and a⁴ ₁₆, basic block 304 ₁₇ includesactivations a¹ ₁₇, a² ₁₇, a³ ₁₇ and a⁴ ₁₇, basic block 304 ₁₈ includesactivations a¹ ₁₈, a² ₁₈, a³ ₁₈ and a⁴ ₁₈, basic block 304 ₁₉ includesactivations a¹ ₁₉, a² ₁₉, a³ ₁₉ and a⁴ ₁₉, and basic block 304 ₂₀includes activations a¹ ₂₀, a² ₂₀, a³ ₂₀ and a⁴ ₂₀.

Basic block 304 ₂₁ includes activations a¹ ₂₁, a² ₂₁, a³ ₂₁ and a⁴ ₂₁,basic block 304 ₂₂ includes activations a¹ ₂₂, a² ₂₂, a³ ₂₂ and a⁴ ₂₂,basic block 304 ₂₃ includes activations a¹ ₂₃, a² ₂₃, a³ ₂₃ and a⁴ ₂₃,basic block 304 ₂₄ includes activations a¹ ₂₄, a² ₂₄, a³ ₂₄ and a⁴ ₂₄,and basic block 304 ₂₅ includes activations a¹ ₂₅, a² ₂₅, a³ ₂₅ and a⁴₂₅.

FIG. 5C depicts basic block matrices 314, in accordance with anembodiment of the present disclosure.

Basic blocks 304 may be reformed as 25 respective basic block matrices314, i.e., basic block matrices 314 ₁, 314 ₂, 314 ₃, 314 ₄, 314 ₅, 314₆, 314 ₇, 314 ₈, 314 ₉, 314 ₁₀, 314 ₁₁, 314 ₁₂, 314 ₁₃, 314 ₁₄, 314 ₁₅,314 ₁₆, 314 ₁₇, 314 ₁₈, 314 ₁₉, 314 ₂₀, 314 ₂₁, 314 ₂₂, 314 ₂₃, 314 ₂₄,and 314 ₂₅. Each basic block matrix 314 has 4 rows and a single column(4×1), and the same activations elements as the respective basic block304. For example, basic block 304 ₁ and basic block matrix 314 ₁ bothinclude activations a¹ ₁, a² ₁, a³ ₁ and a⁴ ₁.

Because input feature maps 204 have been reformed into basic blockmatrices 314, dynamic activation pruning (DAP) may be easily effectedacross the channel dimension rather than the height and widthdimensions, which advantageously avoids adverse effects on the localdata within any particular input data matrix. A sparsity value, p, isinitially selected, such as, for example, 25%, 50% 75%., etc. In thisembodiment, p is 0.5 (or 50%), and half of the values in basic blockmatrices 314 will have a value of 0 after pruning. Advantageously, thepositions of the zeros within each basic block matrix 314 will be randomdue to the nature of DAP.

FIG. 6A depicts a data flow diagram 300 for a portion of a trainingprocess for CNN 15, according to an embodiment of the presentdisclosure.

In order to compress the input feature maps during inference, CNN 15 isfirst trained using a DAP process in the forward path of theconvolutional layer calculations. For example, data flow diagram 300depicts a portion of the training process for CNN 15 that includesconverted convolutional layer calculation 210. During the forward phase,filter 202 is provided to converted convolutional layer calculation 210.Input feature maps are first provided to DAP process 302, which creates,prunes and compresses basic block matrices 314, and then providescompressed basic block matrices 324 to converted convolutional layercalculation 210, as described below. Converted convolutional layercalculation 210 then generates output feature maps 206, which areprovided to the next layer of CNN 15. During the backward phase, thegradients are backpropagated and weight sets 202 ¹, 202 ², 202 ³ and 202⁴ within filter 202 are updated.

In one embodiment, DAP process 302 creates basic block matrices 314based on the input feature maps, selects a number (“k”) of the largestactivation values for each basic block matrix 314 based on the magnitudeof the activation values and the sparsity value, prunes the remainingactivation values in each basic block matrix 314 by setting the valuesto 0, and generates each compressed basic block matrix 324 by discardingthe zero-valued activations from the basic block matrix 314. In otherembodiments, each compressed basic block matrix 324 may be formeddirectly from the identified activation values within the correspondingbasic block matrix 314. A mask is generated for each basic block matrix314 to identify the selected activation values, as described below.

During training, DAP process 302 may be implemented by the processorthat is hosting the training process for CNN 15, such as, for example, agraphics processing unit (GPU), neural processing unit (NPU), centralprocessing unit (CPU), etc. During inference, DAP process 302 must beefficiently executed, which includes identifying or selecting thelargest “k” activation values for each basic block matrix 314. In manyembodiments, a DAP selection circuit may be implemented in hardware,such as, for example, for low power mobile device applications, while inother embodiments, DAP selection may be implemented in software.

FIG. 6B depicts a block diagram of DAP selection circuit 306, inaccordance with an embodiment of the present disclosure.

DAP selection circuit 306 may select activation values for basic blockmatrices that have eight activation elements. Other configurations ofbasic block matrices are also contemplated. DAP selection circuit 306includes input registers 330, i.e., x₁, x₂, x₃, x₄, x₅, x₆, x₇ and x₈,magnitude compactors 341, 342, 343, 344, 345, 346 and 347, and 8-bitoutput register 350 that includes output bits b₁, b₂, b₃, b₄, b₅, b₆, b₇and b₈. Each input register 330 stores one activation value, andincludes an enable bit 332 to allow the stored activation value to beincluded or excluded from the selection process. Each output bit isassociated with a respective input register 330, i.e., output bit b₁ isassociated with input register x₁, and so on.

Magnitude compactor 341 has two inputs coupled to input register x₁ andx₂, and an output. Magnitude compactor 342 has two inputs coupled toinput register x₃ and x₄, and an output. Magnitude compactor 343 has twoinputs coupled to input register x₅ and x₆, and an output. Magnitudecompactor 344 has two inputs coupled to input register x₇ and x₈, and anoutput. Magnitude compactor 345 has two inputs coupled to the outputs ofmagnitude compactors 341 and 342, and an output. Magnitude compactor 346has two inputs coupled to the outputs of magnitude compactors 343 and344, and an output. Magnitude compactor 347 has two inputs coupled tothe outputs of magnitude compactors 345 and 346.

Each magnitude compactor 341, 342, 343, 344, 345 and 346 receives twoactivation values, determines which activation value has the greatermagnitude, and then passes the activation value with the greatermagnitude to the next magnitude compactor in the chain. Magnitudecompactor 347 receives the final two activation values, determines whichactivation value has the greater magnitude, and sets the respectiveoutput bit within output register 350.

If the basic block matrices include 8 activations values and thesparsity value is 0.5 (50%), DAP selection circuit 306 may be executedfour times to select the four largest activation values. Alternatively,4 DAP selection circuits 306 may be executed simultaneously to selectthe four largest activation values.

FIG. 6C depicts a block diagram of DAP selection circuit 308, inaccordance with an embodiment of the present disclosure.

DAP may select activation values for basic block matrices that have fouractivation elements, e.g., basic block matrices 314. DAP selectioncircuit 308 includes input registers 330, i.e., x₁, x₂, x₃ and x₄,magnitude compactors 341, 342 and 347, and 4-bit output register 350that includes output bits b₁, b₂, b₃ and b₄. Each input register 330stores one activation value, and includes an enable bit 332 to allow thestored activation value to be included or excluded from the selectionprocess. Each output bit is associated with a respective input register330, i.e., output bit b₁ is associated with input register x₁, and soon.

Magnitude compactor 341 has two inputs coupled to input register x₁ andx₂, and an output. Magnitude compactor 342 has two inputs coupled toinput register x₃ and x₄, and an output. Magnitude compactor 347 has twoinputs coupled to the outputs of magnitude compactors 341 and 342. Eachmagnitude compactor 341 and 342 receives two activation values,determines which activation value has the greater magnitude, and thenpasses the activation value with the greater magnitude to magnitudecompactor 347. Magnitude compactor 347 receives the final two activationvalues, determines which activation value has the greater magnitude, andsets the respective output bit within output register 350.

In the embodiment discussed above, basic block matrices 314 each have 4activation values, b equals 4, k equals 4, and, in one example, p equals0.5 (50%). In order to determine the 2 largest activation values withina particular basic block matrix 314, DAP selection circuit 308 isexecuted twice, and output register 350 stores the 4-bit mask for thatparticular basic block matrix 314. Two of the bits will be set to 1,indicating that the respective activation values have been selected, andtwo of the bits will be set to 0, indicating that the respectiveactivation values may be discarded during the compression process.

For example, basic block matrix 314 ₁ includes four elements, i.e.,activations a¹ ₁, a² ₁, a³ ₁ and a⁴ ₁, and, in one example, a¹ ₁>a³ ₁>a²₁>a⁴ ₁. During the first execution of DAP selection circuit 308,activation a¹ ₁ is stored in input register x₁ and the enable bit is setto 1, activation a² ₁ is stored in input register x₂ and the enable bitis set to 1, activation a³ ₁ is stored in input register x₂ and theenable bit is set to 1, and activation a⁴ ₁ is stored in input registerx₄ and the enable bit is set to 1. Magnitude compactor 341 determinesthat a¹ ₁>a² ₁, and outputs a¹ ₁ to magnitude compactor 347. Magnitudecompactor 342 determines that a³ ₁>a⁴ ₁, and outputs a³ ₁ to magnitudecompactor 347. Magnitude compactor 347 determines that a¹ ₁>a³ ₁, andsets bit b₁ in output register 250 to 1, and bits b₂, b₃ and b₄ inoutput register 250 to 0.

During the second execution of DAP selection circuit 308, the enable bitfor input register x₁ is set to 0 to prevent activation a¹ ₁ from beingprovided to magnitude compactor 341. Magnitude compactor 341 determinesthat a² ₁>0 (i.e., the value that replaced a¹ ₁), and outputs a² ₁ tomagnitude compactor 347. Magnitude compactor 342 determines that a³ ₁>a⁴₁, and outputs a³ ₁ to magnitude compactor 347. Magnitude compactor 347determines that a³ ₁>a² ₁, and sets bit b₃ in output register 250 to 1.Output register 250 now stores the mask for basic block matrix 314 ₁,which is “1010”.

In another embodiment, an index set may be generated for each basicblock matrix 314 rather than a mask. The index set includes two indexvalues that indicate which activation values have the greatestmagnitude. For example, the index set for basic block matrix 314 ₁ wouldbe {0,2}, which indicates that the first and the third elements of basicblock matrix 314 ₁ have the greatest magnitude. The index values maystart at 0 or 1.

FIG. 7A depicts basic block matrices 314 after DAP selection, inaccordance with an embodiment of the present disclosure.

After DAP selection, each basic block matrix 314 includes two activationvalues and two 0 values, as well as the associated mask (or index set).

Basic block matrix 314 ₁ includes activations a¹ ₁ and a³ ₁ and mask“1010”. Basic block matrix 314 ₂ includes activations a² ₂ and a⁴ ₂ andmask “0101”. Basic block matrix 314 ₃ includes activations a¹ ₃ and a⁴ ₃and mask “1001”. Basic block matrix 314 ₄ includes activations a¹ ₄ anda² ₄ and mask “1100”. Basic block matrix 314 ₅ includes activations a² ₅and a³ ₅ and mask “0110”. Basic block matrix 314 ₆ includes activationsa² ₆ and a⁴ ₆ and mask “0101”. Basic block matrix 314 ₇ includesactivations a¹ ₇ and a⁴ ₇ and mask “1001”. Basic block matrix 314 ₈includes activations a¹ ₈ and a² ₈ and mask “1100”. Basic block matrix314 ₉ includes activations a² ₉ and a³ ₉ and mask “0110”. Basic blockmatrix 314 ₁₀ includes activations a¹ ₁₀ and a³ ₁₀ and mask “1010”.

Basic block matrix 314 ₁₁ includes activations a¹ ₁₁ and a⁴ ₁₁ and mask“1001”. Basic block matrix 314 ₁₂ includes activations a¹ ₁₂ and a² ₁₂and mask “1100”. Basic block matrix 314 ₁₃ includes activations a² ₁₃and a³ ₁₃ and mask “0110”. Basic block matrix 314 ₁₄ includesactivations a¹ ₁₄ and a³ ₁₄ and mask “1010”. Basic block matrix 314 ₁₅includes activations a² ₁₅ and a⁴ ₁₅ and mask “0101”. Basic block matrix314 ₁₆ includes activations a¹ ₁₆ and a² ₁₆ and mask “1100”. Basic blockmatrix 314 ₁₇ includes activations a² ₁₇ and a³ ₁₇ and mask “0110”.Basic block matrix 314 ₁₈ includes activations a¹ ₁₈ and a³ ₁₈ and mask“1010”. Basic block matrix 314 ₁₉ includes activations a² ₁₉ and a⁴ ₁₉and mask “0101”. Basic block matrix 314 ₂₉ includes activations a¹ ₂₀and a⁴ ₂₀ and mask “1001”.

Basic block matrix 314 ₂₁ includes activations a² ₂₁ and a³ ₁ and mask“0110”. Basic block matrix 314 ₂₂ includes activations a¹ ₂₂ and a³ ₂₂and mask “1010”. Basic block matrix 314 ₂₃ includes activations a² ₂₃and a⁴ ₂₃ and mask “0101”. Basic block matrix 314 ₂₄ includesactivations a¹ ₂₄ and a⁴ ₂₄ and mask “1001”. Basic block matrix 314 ₂₅includes activations a¹ ₂₅ and a² ₂₅ and mask “1100”.

FIG. 7B depicts compressed basic block matrices 324, in accordance withan embodiment of the present disclosure.

After DAP selection, compressed basic block matrices 324 are generated,based on the mask or index, by removing the 0 values from basic blockmatrices 314 which resizes each basic block matrix 314 from 4×1 to 2×1.For example, if the sparsity is 50%, each basic block matrix 314 iscompressed into half of its original size plus a small amount of metadata for the mask or index. The compression rate for 8-bit activationsis 64/(32+8)=8/5=1.60.

Compressed basic block matrix 3241 includes activations a¹ ₁ and a³ ₁and mask “1010”. Compressed basic block matrix 324 ₂ includesactivations a² ₂ and a⁴ ₂ and mask “0101”. Compressed basic block matrix324 ₃ includes activations a¹ ₃ and a⁴ ₃ and mask “1001”. Compressedbasic block matrix 324 ₄ includes activations a¹ ₄ and a² ₄ and mask“1100”. Compressed basic block matrix 324 ₅ includes activations a² ₅and a³ ₅ and mask “0110”. Compressed basic block matrix 324 ₆ includesactivations a² ₆ and a⁴ ₆ and mask “0101”. Compressed basic block matrix324 ₇ includes activations a¹ ₇ and a⁴ ₇ and mask “1001”. Compressedbasic block matrix 324 ₈ includes activations a¹ ₈ and a² ₈ and mask“1100”. Compressed basic block matrix 324 ₉ includes activations a² ₉and a³ ₉ and mask “0110”. Compressed basic block matrix 324 ₁₀ includesactivations a¹ ₁₀ and a³ ₁₀ and mask “1010”.

Compressed basic block matrix 324 ₁₁ includes activations a¹ ₁₁ and a⁴₁₁ and mask “1001”. Compressed basic block matrix 324 ₁₂ includesactivations a¹ ₁₂ and a² ₁₂ and mask “1100”. Compressed basic blockmatrix 324 ₁₃ includes activations a² ₁₃ and a³ ₁₃ and mask “0110”.Compressed basic block matrix 324 ₁₄ includes activations a¹ ₁₄ and a³²⁴and mask “1010”. Compressed basic block matrix 324 ₁₅ includesactivations a² ₁₅ and a⁴ ₁₅ and mask “0101”. Compressed basic blockmatrix 324 ₁₆ includes activations a¹ ₁₆ and a² ₁₆ and mask “1100”.Compressed basic block matrix 324 ₁₇ includes activations a² ₁₇ and a³₁₇ and mask “0110”. Compressed basic block matrix 324 ₁₈ includesactivations a¹ ₁₈ and a³ ₁₈ and mask “1010”. Compressed basic blockmatrix 324 ₁₉ includes activations a² ₁₉ and a⁴ ₁₉ and mask “0101”.Compressed basic block matrix 324 ₂₀ includes activations a¹ ₂₀ and a⁴₂₀ and mask “1001”.

Compressed basic block matrix 324 ₂₁ includes activations a² ₂₁ and a³ ₁and mask “0110”. Compressed basic block matrix 324 ₂₂ includesactivations a¹ ₂₂ and a³ ₂₂ and mask “1010”. Compressed basic blockmatrix 324 ₂₃ includes activations a² ₂₃ and a⁴ ₂₃ and mask “0101”.Compressed basic block matrix 324 ₂₄ includes activations a¹ ₂₄ and a⁴₂₄ and mask “1001”. Compressed basic block matrix 324 ₂₅ includesactivations a¹ ₂₅ and a² ₂₅ and mask “1100”.

FIGS. 8A, 8B, 8C and 8D depict weight matrix re-sequencing and convertedinput data matrix re-sequencing and compression, in accordance with anembodiment of the present disclosure.

FIG. 8A depicts weight matrix re-sequencing and converted input datamatrix re-sequencing and compression for output element o¹ ₁ ofconverted output data matrix 216. Output element o¹ ₁ is equal to thedot product of the first row of converted weight matrix 212, i.e.,converted weight set 212 ¹, and the first column of converted input datamatrix 214, as follows:

o¹₁ = w¹₁ ⋅ a¹₁ + w¹₂ ⋅ a¹₂ + w¹₃ ⋅ a¹₆ + w¹₄ ⋅ a¹₇ + w¹₅ ⋅ a²₁ + w¹₆ ⋅ a²₂ + w¹₇ ⋅ a²₇ + w¹₈ ⋅ a²₇ + w¹₉ ⋅ a³₁ + w¹₁₀ ⋅ a³₂ + w¹₁₁ ⋅ a³₆ + w¹₁₂ ⋅ a³₇ + w¹₁₃ ⋅ a⁴₁ + w¹₁₄ ⋅ a⁴₂ + w¹₁₅ ⋅ a⁴₆ + w¹₁₆ ⋅ a⁴₇

The dot product of the first row of converted weight matrix 212 and thefirst column of converted input data matrix 214 may be re-sequenced asfollows:

o¹₁ = (w¹₁ ⋅ a¹₁ + w¹₅ ⋅ a²₁ + w¹₉ ⋅ a³₁ + w¹₁₃ ⋅ a⁴₁) + (w¹₂ ⋅ a²₂ + w¹₆ ⋅ a²₂ + w¹₁₀ ⋅ a³₂ + w¹₁₄ ⋅ a⁴₂) + (w¹₃ ⋅ a¹₆ + w¹₇ ⋅ a²₆ + w¹₁₁ ⋅ a³₆ + w¹₁₅ ⋅ a⁴₆) + (w¹₄ ⋅ a¹₇ + w¹₈ ⋅ a²₇ + w¹₁₂ ⋅ a³₇ + w¹₁₆ ⋅ a⁴₇)

It is clear that the first column of converted input data matrix 214includes four, interspersed basic block matrices 314, i.e., basic blockmatrix 314 ₁ (activations a¹ ₁, a² ₁, a³ ₁ and a⁴ ₁), basic block matrix314 ₂ (activations a¹ ₂, a² ₂, a³ ₂ and a⁴ ₂), basic block matrix 314 ₆(activations a¹ ₆, a² ₆, a³ ₆ and a⁴ ₆) and basic block matrix 314 ₇(activations a¹ ₇, a² ₇, a³ ₇ and a⁴ ₇).

Advantageously, the first column of converted input data matrix 214 maybe re-sequenced into basic block matrices 314 ₁, 314 ₂, 314 ₆ and 314 ₇,and then basic block matrices 314 ₁, 314 ₂, 314 ₆ and 314 ₇ may becompressed and replaced by compressed basic block matrices 324 ₁, 324 ₂,324 ₆ and 324 ₇. Converted weight set 212 ¹ may also be re-sequenced togenerate four weight groups 322; each weight group 322 corresponds toone of basic block matrices 314 ₁, 314 ₂, 314 ₆ and 314 ₇. The firstweight group includes w¹ ₁, w¹ ₅, w¹ ₉ and w¹ ₁₃, the second weightgroup includes w¹ ₂, w¹ ₆, w¹ ₁₀ and w¹ ₁₄, the third weight groupincludes w¹ ₃, w¹ ₇, w¹ ₁₁ and w¹ ₁₅, and the fourth weight groupincludes w¹ ₄, w¹ ₈, w¹ ₁₂ and w¹ ₁₆.

FIG. 8B depicts weight matrix re-sequencing and converted input datamatrix re-sequencing and compression for output element o² ₂ ofconverted output data matrix 216. Output element o² ₂ is equal to thedot product of the second row of converted weight matrix 212, i.e.,converted weight set 212 ², and the second column of converted inputdata matrix 214, as follows:

o²₂ = w²₁ ⋅ a¹₂ + w²₂ ⋅ a¹₃ + w²₃ ⋅ a¹₇ + w²₄ ⋅ a¹₈ + w²₅ ⋅ a²₂ + w²₆ ⋅ a²₃ + w²₇ ⋅ a²₇ + w²₈ ⋅ a²₈ + w²₉ ⋅ a³₂ + w²₁₀ ⋅ a³₃ + w²₁₁ ⋅ a³₇ + w²₁₂ ⋅ a³₈ + w²₁₃ ⋅ a⁴₂ + w²₁₄ ⋅ a⁴₃ + w²₁₅ ⋅ a⁴₇ + w²₁₆ ⋅ a⁴₈

The dot product of the second row of converted weight matrix 212 and thesecond column of converted input data matrix 214 may be re-sequenced asfollows:

o²₂ = (w²₁ ⋅ a¹₂ + w²₅ ⋅ a²₂ + w²₉ ⋅ a³₂ + w²₁₃ ⋅ a⁴₂) + (w²₂ ⋅ a²₃ + w²₆ ⋅ a²₃ + w²₁₀ ⋅ a³₃ + w²₁₄ ⋅ a⁴₃) + (w²₃ ⋅ a¹₇ + w²₇ ⋅ a²₇ + w²₁₁ ⋅ a³₇ + w²₁₅ ⋅ a⁴₇) + (w²₄ ⋅ a¹₈ + w²₈ ⋅ a²₈ + w²₁₂ ⋅ a³₈ + w²₁₆ ⋅ a⁴₈)

It is clear that the second column of converted input data matrix 214includes four, interspersed basic block matrices 314, i.e., basic blockmatrix 314 ₂ (activations a¹ ₂, a² ₂, a³ ₂ and a⁴ ₂), basic block matrix314 ₃ (activations a¹ ₃, a² ₃, a³ ₃ and a⁴ ₃), basic block matrix 314 ₇(activations a¹ ₇, a² ₇, a³ ₇ and a⁴ ₇) and basic block matrix 314 ₈(activations a¹ ₈, a² ₈, a³ ₈ and a⁴ ₈).

Advantageously, the second column of converted input data matrix 214 maybe re-sequenced into basic block matrices 314 ₂, 314 ₃, 314 ₇ and 314 ₈,and then basic block matrices 314 ₂, 314 ₃, 314 ₇ and 314 ₈ may becompressed and replaced by compressed basic block matrices 324 ₂, 324 ₃,324 ₇ and 324 ₈. Converted weight set 212 ² may also be re-sequenced togenerate four weight groups 322; each weight group 322 corresponds toone of basic block matrices 314 ₂, 314 ₃, 314 ₇ and 314 ₈. The firstweight group includes w² ₁, w² ₅, w² ₉ and w² ₁₃, the second weightgroup includes w² ₂, w² ₆, w² ₁₀ and w² ₁₄, the third weight groupincludes w² ₃, w² ₇, w² ₁₁ and w² ₁₅, and the fourth weight groupincludes w² ₄, w² ₈, w² ₁₂ and w² ₁₆.

FIG. 8C depicts weight matrix re-sequencing and converted input datamatrix re-sequencing and compression for output element o³ ₃ ofconverted output data matrix 216. Output element o³ ₃ is equal to thedot product of the third row of converted weight matrix 212, i.e.,converted weight set 212 ³, and the third column of converted input datamatrix 214, as follows:

o³₃ = w³₁ ⋅ a¹₃ + w³₂ ⋅ a¹₄ + w³₃ ⋅ a¹₈ + w³₄ ⋅ a¹₉ + w³₅ ⋅ a²₃ + w³₆ ⋅ a²₄ + w³₇ ⋅ a²₈ + w³₈ ⋅ a²₉ + w³₉ ⋅ a³₃ + w³₁₀ ⋅ a³₄ + w³₁₁ ⋅ a³₈ + w³₁₂ ⋅ a³₉ + w³₁₃ ⋅ a⁴₃ + w³₁₄ ⋅ a⁴₄ + w³₁₅ ⋅ a⁴₈ + w³₁₆ ⋅ a⁴₉

The dot product of the third row of converted weight matrix 212 and thethird column of converted input data matrix 214 may be re-sequenced asfollows:

o³₃ = (w³₁ ⋅ a¹₃ + w³₅ ⋅ a²₃ + w³₉ ⋅ a³₃ + w³₁₃ ⋅ a⁴₃) + (w³₂ ⋅ a²₄ + w³₆ ⋅ a²₄ + w³₁₀ ⋅ a³₄ + w³₁₄ ⋅ a⁴₄) + (w³₃ ⋅ a¹₈ + w³₇ ⋅ a²₈ + w³₁₁ ⋅ a³₈ + w³₁₅ ⋅ a⁴₈) + (w³₄ ⋅ a¹₉ + w³₈ ⋅ a²₉ + w³₁₂ ⋅ a³₉ + w³₁₆ ⋅ a⁴₉)

It is clear that the third column of converted input data matrix 214includes four, interspersed basic block matrices 314, i.e., basic blockmatrix 314 ₃ (activations a¹ ₃, a² ₃, a³ ₃ and a⁴ ₃), basic block matrix314 ₄ (activations a¹ ₄, a² ₄, a³ ₄ and a⁴ ₄), basic block matrix 314 ₈(activations a¹ ₈, a² ₈, a³ ₈ and a⁴ ₈) and basic block matrix 314 ₉(activations a¹ ₉, a² ₉, a³ ₉ and a⁴ ₉).

Advantageously, the third column of converted input data matrix 214 maybe re-sequenced into basic block matrices 314 ₃, 314 ₄, 314 ₈ and 314 ₉,and then basic block matrices 314 ₃, 314 ₄, 314 ₈ and 314 ₉ may becompressed and replaced by compressed basic block matrices 324 ₃, 324 ₄,324 ₈ and 324 ₉. Converted weight set 212 ³ may also be re-sequenced togenerate four weight groups 322; each weight group 322 corresponds toone of basic block matrices 314 ₃, 314 ₄, 314 ₈ and 314 ₉. The firstweight group includes w³ ₁, w³ ₅, w³ ₉ and w³ ₁₃, the second weightgroup includes w³ ₂, w³ ₆, w³ ₁₀ and w³ ₁₄, the third weight groupincludes w³ ₃, w³ ₇, w³ ₁₁ and w³ ₁₅, and the fourth weight groupincludes w³ ₄, w³ ₈, w³ ₁₂ and w³ ₁₆.

FIG. 8D depicts weight matrix re-sequencing and converted input datamatrix re-sequencing and compression for output element o⁴ ₄ ofconverted output data matrix 216. Output element o⁴ ₄ is equal to thedot product of the fourth row of converted weight matrix 212, i.e.,converted weight set 212 ⁴, and the fourth column of converted inputdata matrix 214, as follows:

o⁴₄ = w⁴₁ ⋅ a¹₄ + w⁴₂ ⋅ a¹₅ + w⁴₃ ⋅ a¹₉ + w⁴₄ ⋅ a¹₁₀ + w⁴₅ ⋅ a²₄ + w⁴₆ ⋅ a²₅ + w⁴₇ ⋅ a²₉ + w⁴₈ ⋅ a²₁₀ + w⁴₉ ⋅ a³₄ + w⁴₁₀ ⋅ a³₅ + w⁴₁₁ ⋅ a³₉ + w⁴₁₂ ⋅ a³₁₀ + w⁴₁₃ ⋅ a⁴₄ + w⁴₁₄ ⋅ a⁴₅ + w⁴₁₅ ⋅ a⁴₉ + w⁴₁₆ ⋅ a⁴₁₀

The dot product of the fourth row of converted weight matrix 212 and thefourth column of converted input data matrix 214 may be re-sequenced asfollows:

o⁴₄ = (w⁴₁ ⋅ a¹₄ + w⁴₅ ⋅ a²₄ + w⁴₉ ⋅ a³₄ + w⁴₁₃ ⋅ a⁴₄) + (w⁴₂ ⋅ a²₅ + w⁴₆ ⋅ a²₅ + w⁴₁₀ ⋅ a³₅ + w⁴₁₄ ⋅ a⁴₅) + (w⁴₃ ⋅ a¹₉ + w⁴₇ ⋅ a²₉ + w⁴₁₁ ⋅ a³₉ + w⁴₁₅ ⋅ a⁴₉) + (w⁴₄ ⋅ a¹₁₀ + w⁴₈ ⋅ a²₁₀ + w⁴₁₂ ⋅ a³₁₀ + w⁴₁₆ ⋅ a⁴₁₀)

It is clear that the fourth column of converted input data matrix 214includes four, interspersed basic block matrices 314, i.e., basic blockmatrix 314 ₄ (activations a¹ ₄, a² ₄, a³ ₄ and a⁴ ₄), basic block matrix314 ₅ (activations a¹ ₅, a² ₅, a³ ₅ and a⁴ ₅), basic block matrix 314 ₉(activations a¹ ₉, a² ₉, a³ ₉ and a⁴ ₉) and basic block matrix 314 ₁₀(activations a¹ ₁₀, a² ₁₀, a³ ₁₀ and a⁴ ₁₀).

Advantageously, the fourth column of converted input data matrix 214 maybe re-sequenced into basic block matrices 314 ₄, 314 ₅, 314 ₉ and 314₁₀, and then basic block matrices 314 ₄, 314 ₅, 314 ₉ and 314 ₁₀ may becompressed and replaced by compressed basic block matrices 324 ₄, 324 ₅,324 ₉ and 324 ₁₀. Converted weight set 212 ⁴ may also be re-sequenced togenerate four weight groups 322; each weight group 322 corresponds toone of basic block matrices 314 ₄, 314 ₅, 314 ₉ and 314 ₁₀. The firstweight group includes w⁴ ₁, w⁴ ₅, w⁴ ₉ and w⁴ ₁₃, the second weightgroup includes w⁴ ₂, w⁴ ₆, w⁴ ₁₀ and w⁴ ₁₄, the third weight groupincludes w⁴ ₃, w⁴ ₇, w⁴ ₁₁ and w⁴ ₁₅, and the fourth weight groupincludes w⁴ ₄, w⁴ ₈, w⁴ ₁₂ and w⁴ ₁₆.

This re-sequencing and compression may be performed in software orhardware.

FIG. 9 depicts a data flow diagram 360 for DAP MAC array 318, inaccordance with an embodiment of the present disclosure.

Each DAP MAC (DM) unit calculates a dot product, between a row ofconverted weight matrix 212 and a column of converted input data matrix214, to generate an element of converted output data matrix 216.Generally, the rows from converted weight matrix 212 are read from localmemory, re-sequenced, and provided to DAP MAC array 318 as a number ofweight groups 322. Similarly, the columns from converted input datamatrix 214 are read from local memory, re-sequenced and compressed, andprovided to DAP MAC array 318 as a number of compressed basic blockmatrices 324. Each DAP MAC unit includes, inter alia, a data selectioncircuit, two or more multiplexers, two or more multipliers, an adder anda storage register, and is reset by clearing or zeroing its storageregister prior to, or at the start of, a new dot product calculation.

In this embodiment, each row of converted weight matrix 212 isre-sequenced and then divided into a number of weight groups, asdiscussed above. Each weight group from each row is simultaneouslyprovided to all of the DAP MAC units in one column during a differentprocessing cycle.

More particularly, the re-sequenced weight groups of the first row ofconverted weight matrix 212 (i.e., converted weight matrix 212 ¹) areprovided to the first column of DAP MAC units in DAP MAC array 318(i.e., DM₁, DM₅, DM₉ and DM₁₃). The re-sequenced weight groups of thesecond row of converted weight matrix 212 (i.e., converted weight matrix212 ²) are provided to the second column of DAP MAC units in DAP MACarray 318 (i.e., DM₂, DM₆, DM₁₀ and DM₁₄). The re-sequenced weightgroups of the third row of converted weight matrix 212 (i.e., convertedweight matrix 212 ³) are provided to the third column of DAP MAC unitsin DAP MAC array 318 (i.e., DM₃, DM₇, DM₁₁ and DM₁₅). The re-sequencedweight groups of the fourth row of converted weight matrix 212 (i.e.,converted weight matrix 212 ⁴) are provided to the fourth column of DAPMAC units in DAP MAC array 318 (i.e., DM₄, DM₈, DM₁₂ and DM₁₆).

Similarly, each column of converted input data matrix 214 isre-sequenced and compressed into a number of compressed basic blockmatrices 324, as discussed above. Each compressed basic block matrix 324from each column is simultaneously provided to all of the DAP MAC unitsin one row during a different processing cycle.

More particularly, the compressed basic block matrices 324 of the firstcolumn of converted input data matrix 214 are provided to the first rowof DAP MAC units in DAP MAC array 318 (i.e., DM₁, DM₂, DM₃ and DM₄). Thecompressed basic block matrices 324 of the 5^(th), 9^(th) and 13^(th)columns are also provided to the first row of DAP MAC units during laterprocessing cycles. The compressed basic block matrices 324 of the secondcolumn of converted input data matrix 214 are provided to the second rowof DAP MAC units in DAP MAC array 318 (i.e., DM₅, DM₆, DM₇ and DM₈). Thecompressed basic block matrices 324 of the 6^(th), 10^(th) and 14^(th)columns are also provided to the second row of DAP MAC units duringlater processing cycles. The compressed basic block matrices 324 of thethird column of converted input data matrix 214 are provided to thethird row of DAP MAC units in DAP MAC array 318 (i.e., DM₉, DM₁₀, DM₁₁and DM₁₂). The compressed basic block matrices 324 of the 7^(th),11^(th) and 15^(th) columns are also provided to the third row of DAPMAC units during later processing cycles. The compressed basic blockmatrices 324 of the fourth column of converted input data matrix 214 areprovided to the fourth row of DAP MAC units in DAP MAC array 318 (i.e.,DM₁₃, DM₁₄, DM₁₅ and DM₁₆). The compressed basic block matrices 324 ofthe 8^(th), 12^(th) and 16^(th) columns are also provided to the fourthrow of DAP MAC units during later processing cycles.

In another embodiment, the weights groups and compressed basic blockmatrices 324 may be provided to DAP MAC array 318 using delay registersor flip flops similar to those described for MAC array 218. While theprinciple of operation is the same, the number of delay registers forthe rows of DAP MAC array 318 is doubled, i.e., two delay registers forthe second row, four delay registers for the third row and six delayregister for the fourth row, due to the presentation of two activationelements per compressed basic block matrix during each processing cycle.Similarly, the number of delay registers for the columns of DAP MACarray 318 is quadrupled, i.e., four delay registers for the secondcolumn, eight delay registers for the third column and 12 delayregisters for the fourth column, due to the presentation of four weightsper weight group during each processing cycle. Larger weight group sizesand compressed basic block matrix dimensions are also contemplated,which would require additional delay registers.

The first row of DAP MAC array 318 includes DAP MAC units DM₁, DM₂, DM₃and DM₄. The dot product calculations performed by DM₁ for the blocks ofthe first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1) ofconverted input data matrix 214 are discussed in detail below, while thedot product calculations performed by DM₂, DM₃ and DM₄ are summarizedbelow. DM₁ calculates the dot product of the first row of convertedweight matrix 212 (i.e., converted weight set 212 ¹) and the firstcolumn of converted input data matrix 214 to generate element o¹ ₁ ofconverted output data matrix 216.

During the processing cycle 1, DM₁ receives w¹ ₁, w¹ ₅, w¹ ₉, w¹ ₁₃, a¹₁, a³ ₁ and mask “1010” from local memory, selects w¹ ₁ and w¹ ₉ basedon mask “1010”, multiplies w¹ ₁ and a¹ ₁ to generate a firstintermediate product, multiples w¹ ₉ and a³ ₁ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register (i.e., 0), and stores theaccumulated result back in the storage register.

During the processing cycle 2, DM₁ receives w¹ ₂, w¹ ₆, w¹ ₁₀, w¹ ₁₄, a²₂, a⁴ ₂ and mask “0101” from local memory, selects w¹ ₆ and w¹ ₁₄ basedon mask “0101”, multiplies w¹ ₆ and a² ₂ to generate a firstintermediate product, multiples w¹ ₁₄ and a⁴ ₂ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 3, DM₁ receives w¹ ₃, w¹ ₇, w¹ ₁₁, w¹ ₁₅, a²₆, a⁴ ₆ and mask “0101” from local memory, selects w¹ ₇ and w¹ ₁₅ basedon mask “0101”, multiplies w¹ ₇ and a² ₆ to generate a firstintermediate product, multiples w¹ ₁₅ and a⁴ ₆ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 4, DM₁ receives w¹ ₄, w¹ ₈, w¹ ₁₂, w¹ ₁₆, a¹₇, a⁴ ₇ and mask “1001” from local memory, selects w¹ ₄ and w¹ ₁₆ basedon mask “1001”, multiplies w¹ ₄ and a² ₇ to generate a firstintermediate product, multiples w¹ ₁₆ and a⁴ ₇ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register. At the end of the processing cycle4, DM₁ outputs element o¹ ₁.

The remainder of the first row of DAP MAC array 318 includes DM₂, DM₃and DM₄. These DAP MAC units operate in the same manner as DM₁ butreceive weight groups from the second, third and fourth rows ofconverted weight matrix 212, respectively.

The second row of DAP MAC array 318 includes DAP MAC units DM₅, DM₆, DM₇and DM₈. The dot product calculations performed by DM₅ for the blocks ofthe first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1) ofconverted input data matrix 214 are discussed in detail below, while thedot product calculations performed by DM₆, DM₇ and DM₈ are summarizedbelow. DM₅ calculates the dot product of the first row of convertedweight matrix 212 (i.e., converted weight set 212 ¹) and the secondcolumn of converted input data matrix 214 to generate element o¹ ₂ ofconverted output data matrix 216.

During the processing cycle 1, DM₅ receives w¹ ₁, w¹ ₅, w¹ ₉, w¹ ₁₃, a²₂, a⁴ ₂ and mask “0101” from local memory, selects w¹ ₅ and w¹ ₁₃ basedon mask “0101”, multiplies w¹ ₅ and a² ₂ to generate a firstintermediate product, multiples w¹ ₁₃ and a⁴ ₂ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register (i.e., 0), and stores theaccumulated result back in the storage register.

During the processing cycle 2, DM₅ receives w¹ ₂, w¹ ₆, w¹ ₁₀, w¹ ₁₄, a¹₃, a⁴ ₃ and mask “1001” from local memory, selects w¹ ₂ and w¹ ₁₄ basedon mask “1001”, multiplies w¹ ₂ and a¹ ₃ to generate a firstintermediate product, multiples w¹ ₁₄ and a⁴ ₃ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 3, DM₅ receives w¹ ₃, w¹ ₇, w¹ ₁₁, w¹ ₁₅, a¹₇, a⁴ ₇ and mask “1001” from local memory, selects w¹ ₃ and w¹ ₁₅ basedon mask “1001”, multiplies w¹ ₇ and a¹ ₇ to generate a firstintermediate product, multiples w¹ ₁₅ and a⁴ ₇ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 4, DM₅ receives w¹ ₄, w¹ ₈, w¹ ₁₂, w¹ ₁₆, a¹₈, a² ₈ and mask “1100” from local memory, selects w¹ ₄ and w¹ ₈ basedon mask “1100”, multiplies w¹ ₄ and a¹ ₈ to generate a firstintermediate product, multiples w¹ ₈ and a² ₈ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register. At the end of the processing cycle4, DM₅ outputs element o¹ ₂.

The remainder of the second row of DAP MAC array 318 includes DM₆, DM₇and DM₈. These DAP MAC units operate in the same manner as DM₅ butreceive weight groups from the second, third and fourth rows ofconverted weight matrix 212, respectively.

The third row of DAP MAC array 318 includes DAP MAC units DM₉, DM₁₀,DM₁₁ and DM₁₂. The dot product calculations performed by DM₉ for theblocks of the first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1)of converted input data matrix 214 are discussed in detail below, whilethe dot product calculations performed by DM₁₀, DM₁₁ and DM₁₂ aresummarized below. DM₉ calculates the dot product of the first row ofconverted weight matrix 212 (i.e., converted weight set 212 ¹) and thethird column of converted input data matrix 214 to generate element o¹ ₃of converted output data matrix 216.

During the processing cycle 1, DM₉ receives w¹ ₁, w¹ ₅, w¹ ₉, w¹ ₁₃, a¹₃, a⁴ ₃ and mask “1001” from local memory, selects w¹ ₁ and w¹ ₁₃ basedon mask “1001”, multiplies w¹ ₁ and a¹ ₃ to generate a firstintermediate product, multiples w¹ ₁₃ and a⁴ ₃ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register (i.e., 0), and stores theaccumulated result back in the storage register.

During the processing cycle 2, DM₉ receives w¹ ₂, w¹ ₆, w¹ ₁₀, w¹ ₁₄, a¹₄, a² ₄ and mask “1100” from local memory, selects w¹ ₂ and w¹ ₆ basedon mask “1100”, multiplies w¹ ₂ and a¹ ₄ to generate a firstintermediate product, multiples w¹ ₆ and a² ₄ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 3, DM₉ receives w¹ ₃, w¹ ₇, w¹ ₁₁, w¹ ₁₅, a¹₈, a² ₈ and mask “1100” from local memory, selects w¹ ₃ and w¹ ₇ basedon mask “1100”, multiplies w¹ ₃ and a¹ ₈ to generate a firstintermediate product, multiples w¹ ₇ and a² ₈ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 4, DM₉ receives w¹ ₄, w¹ ₈, w¹ ₁₂, w¹ ₁₆, a²₉, a³ ₉ and mask “0110” from local memory, selects w¹ ₈ and w¹ ₁₂ basedon mask “0110”, multiplies w¹ ₈ and a² ₉ to generate a firstintermediate product, multiples w¹ ₁₂ and a³ ₉ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register. At the end of the processing cycle4, DM₉ outputs element o¹ ₃.

The remainder of the third row of DAP MAC array 318 includes DM₁₀, DM₁₁and DM₁₂. These DAP MAC units operate in the same manner as DM₉ butreceive weight groups from the second, third and fourth rows ofconverted weight matrix 212, respectively.

The fourth row of DAP MAC array 318 includes DAP MAC units DM₁₃, DM₁₄,DM₁₅ and DM₁₆. The dot product calculations performed by DM₁₃ for theblocks of the first quadrants a¹ _(q1), a² _(q1), a³ _(q1) and a⁴ _(q1)of converted input data matrix 214 are discussed in detail below, whilethe dot product calculations performed by DM₁₄, DM₁₅ and DM₁₆ aresummarized below. DM₁₃ calculates the dot product of the first row ofconverted weight matrix 212 (i.e., converted weight set 212 ¹) and thefourth column of converted input data matrix 214 to generate element o¹₄ of converted output data matrix 216.

During the processing cycle 1, DM₁₃ receives w¹ ₁, w¹ ₅, w¹ ₉, w¹ ₁₃, a¹₄, a² ₄ and mask “1100” from local memory, selects w¹ ₁ and w¹ ₅ basedon mask “1100”, multiplies w¹ ₁ and a¹ ₄ to generate a firstintermediate product, multiples w¹ ₅ and a² ₄ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register (i.e., 0), and stores theaccumulated result back in the storage register.

During the processing cycle 2, DM₁₃ receives w¹ ₂, w¹ ₆, w¹ ₁₀, w¹ ₁₄,a² ₅, a³ ₅ and mask “0110” from local memory, selects w¹ ₆ and w¹ ₁₀based on mask “0110”, multiplies w¹ ₆ and a² ₅ to generate a firstintermediate product, multiples w¹ ₁₀ and a³ ₅ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 3, DM₁₃ receives w¹ ₃, w¹ ₇, w¹ ₁₁, w¹ ₁₅,a² ₉, a³ ₉ and mask “0110” from local memory, selects w¹ ₇ and w¹ ₁₁based on mask “0110”, multiplies w¹ ₇ and a² ₉ to generate a firstintermediate product, multiples w¹ ₁₁ and a³ ₉ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register.

During the processing cycle 4, DM₁₃ receives w¹ ₄, w¹ ₈, w¹ ₁₂, w¹ ₁₆,a¹ ₁₀, a³ ₁₀ and mask “1010” from local memory, selects w¹ ₄ and w¹ ₁₂based on mask “1010”, multiplies w¹ ₄ and a¹ ₁₀ to generate a firstintermediate product, multiples w¹ ₁₂ and a³ ₁₀ to generate a secondintermediate product, adds the first and second intermediate products tothe value stored in the storage register, and stores the accumulatedresult back in the storage register. At the end of the processing cycle4, DM₁₃ outputs element o¹ ₄.

The remainder of the fourth row of DAP MAC array 318 includes DM₁₀, DM₁₁and DM₁₂. These DAP MAC units operate in the same manner as DM₉ butreceive weight groups from the second, third and fourth rows ofconverted weight matrix 212, respectively.

After the blocks of the first quadrants a¹ _(q1), a² _(q1), a³ _(q1) anda⁴ _(q1) of converted input data matrix 214 have been processed, thenext sequence of operations processes the blocks of the second quadrantsa¹ _(q2), a² _(q2), a³ _(q2) and a⁴ _(q2). After the blocks of thesecond quadrants a¹ _(q2), a² _(q2), a³ _(q2) and a⁴ _(q2) have beenprocessed, the next sequence of operations processes the blocks of thethird quadrants a¹ _(q3), a² _(q3), a³ _(q3) and a⁴ _(q3). And, afterthe blocks of the third quadrants a¹ _(q2), a² _(q3), a³ _(q3) and a⁴_(q3) have been processed, the final sequence of operations processesthe blocks of the fourth quadrants a¹ _(q4), a² _(q4), a³ _(q4) and a⁴_(q4). Converted weight matrix 212 is accessed for each sequence ofoperations.

Advantageously, compressed basic block matrices 324 may retain theircompressed form in the datapath. In another embodiment, converted weightmatrix 212 may be compressed in a similar manner, and each compressedweight set may include a mask to identify weights with non-zero values.

FIG. 10 depicts a block diagram of system 100, in accordance with anembodiment of the present disclosure.

Computer 102 includes bus 110 coupled to one or more processors 120,memory 130, I/O interfaces 140, display interface 150, one or morecommunication interfaces 160 and one or more MMAs 400. Generally, I/Ointerfaces 140 are coupled to I/O devices 142 using a wired or wirelessconnection, display interface 150 is coupled to display 152, andcommunication interface 160 is connected to network 162 using a wired orwireless connection.

Bus 110 is a communication system that transfers data between processor120, memory 130, I/O interfaces 140, display interface 150,communication interface 160, MMA 400, as well as other components notdepicted in FIG. 1. Power connector 112 is coupled to bus 110 and apower supply (not shown).

Processor 120 includes one or more general-purpose orapplication-specific microprocessors that executes instructions toperform control, computation, input/output, etc. functions for computer102. Processor 120 may include a single integrated circuit, such as amicro-processing device, or multiple integrated circuit devices and/orcircuit boards working in cooperation to accomplish the functions ofprocessor 120. In addition, processor 120 may execute computer programsor modules, such as operating system 132, software modules 134, etc.,stored within memory 130. For example, software modules 134 may includean ML application, an ANN application, a CNN application, etc.

Generally, storage element or memory 130 stores instructions forexecution by processor 120 and data. Memory 130 may include a variety ofnon-transitory computer-readable medium that may be accessed byprocessor 120. In various embodiments, memory 130 may include volatileand nonvolatile medium, non-removable medium and/or removable medium.For example, memory 130 may include any combination of random accessmemory (RAM), dynamic RAM (DRAM), static RAM (SRAM), read only memory(ROM), flash memory, cache memory, and/or any other type ofnon-transitory computer-readable medium.

Memory 130 contains various components for retrieving, presenting,modifying, and storing data. For example, memory 130 stores softwaremodules that provide functionality when executed by processor 120. Thesoftware modules include operating system 132 that provides operatingsystem functionality for computer 102. Software modules 134 providevarious functionality, such as image classification using convolutionalneural networks, etc. Data 136 may include data associated withoperating system 132, software modules 134, etc.

I/O interfaces 140 are configured to transmit and/or receive data fromI/O devices 142. I/O interfaces 140 enable connectivity betweenprocessor 120 and I/O devices 142 by encoding data to be sent fromprocessor 120 to I/O devices 142, and decoding data received from I/Odevices 142 for processor 120. Generally, data may be sent over wiredand/or wireless connections. For example, I/O interfaces 140 may includeone or more wired communications interfaces, such as USB, Ethernet,etc., and/or one or more wireless communications interfaces, coupled toone or more antennas, such as WiFi, Bluetooth, cellular, etc.

Generally, I/O devices 142 provide input to computer 102 and/or outputfrom computer 102. As discussed above, I/O devices 142 are operablyconnected to computer 102 using a wired and/or wireless connection. I/Odevices 142 may include a local processor coupled to a communicationinterface that is configured to communicate with computer 102 using thewired and/or wireless connection. For example, I/O devices 142 mayinclude a keyboard, mouse, touch pad, joystick, etc.

Display interface 150 is configured to transmit image data from computer102 to monitor or display 152.

Communication interface 160 is configured to transmit data to and fromnetwork 162 using one or more wired and/or wireless connections. Network162 may include one or more local area networks, wide area networks, theInternet, etc., which may execute various network protocols, such as,for example, wired and/or wireless Ethernet, Bluetooth, etc. Network 162may also include various combinations of wired and/or wireless physicallayers, such as, for example, copper wire or coaxial cable networks,fiber optic networks, Bluetooth wireless networks, WiFi wirelessnetworks, CDMA, FDMA and TDMA cellular wireless networks, etc.

MMA 400 is configured to multiply matrices and generate output matricesto support various applications implemented by software modules 134.

FIG. 11 depict a block diagram of MMA 400, in accordance withembodiments of the present disclosure.

MMA 400 includes PE array 402, I/O interface 410, register 420, register430 and register 440.

In this embodiment, PE array 402 includes 16 PEs 450 arranged in a 4×4array; other numbers of PEs 450 and arrangements are also contemplated,such as, for example, four PEs 450 arranged in a 2×2 array, nine PEs 450arranged in a 3×3 array, 25 PEs 450 arranged in a 5×5 array, 36 PEs 450arranged in a 6×6 array, 49 PEs 450 arranged in a 7×7 array, 64 PEs 450arranged in a 8×8 array, etc. Non-symmetric arrangements, such as a 2×3array, a 3×4 array, a 4×5 array, a 4×6 array, etc., may be advantageousfor certain applications. Each PE 450 is coupled to register 420,register 430 and register 440, and calculates a dot product for oneelement of converted output data matrix 216.

For example, the PE 450 located in the first row and the first column(i.e., upper left corner) of PE array 402 calculates the dot products ofthe 1^(st) row of converted weight matrix 212 and the 1^(st), 5^(th),9^(th) and 13^(th) columns of converted input data matrix 214 togenerate the o¹ ₁, o¹ ₅, o¹ ₉ and o¹ ₁₃ elements of converted outputdata matrix 216, as discussed above with respect to DAP MAC unit DM₁.

I/O interface 410 is coupled to bus 110, register 420, register 430 andregister 440. I/O interface 410 includes a microcontroller that sendsdata to, and receives data and commands from, processor 120, memory 130,etc. The microcontroller implements a set of instructions that controlsthe data flow and the operation of PEs 450.

In some embodiments, a dedicated controller, microcontroller, fieldprogrammable gate array (FPGA), etc., may control the data flow and theoperation of MMA 400. For example, the controller may implementload/store (L/S) instructions, memory mapped I/O (MMIO), direct memoryaccess (DMA), etc., to load the compressed matrices and correspondingmasks into register 420 and the weights into register 430, start thematrix multiply operation, read back the output matrix from register440, etc. More particularly, one or more software modules 134, executingon processor 120, may calculate the masks and compress the matrices,send these data and the appropriate commands to MMA 400 to uploadregisters 420 and 430, start the matrix multiply operation, read backthe results from register 440, etc.

Register 420 includes vector register 422 and scalar register 424.Vector register 422 stores the elements of the compressed matrices inthe multiplication operation, such as compressed basic block matrices324. Scalar register 424 stores the masks associated with the compressedmatrices in the multiplication operation. In this embodiment, scalarregister 424 is 32 bits wide, and vector register 422 is 8 elementswide, each element being the same size as the data contained withincompressed basic block matrices 324, such as, for example, 8 bit integerdata, 16 bit integer data, 32 bit integer data, 16 bit floating pointdata, 16 bit Bfloat data, 32 bit floating point data, etc.

Generally, vector register 422 simultaneously provides “m” elements toeach row of PEs 450 in PE array 402. In this embodiment, vector register422 simultaneously provides two elements to each row of PEs 450 in PEarray 402. In certain embodiments, vector register 422 may be dividedinto four individual vector registers, while in other embodiments,vector register 422 may be a single register.

Register 430 includes vector register 432. Vector register 432 storesthe elements of the other matrix in the multiplication operation, suchas converted weight matrix 212. In this embodiment, vector register 432is 16 elements wide, each element being the same size as the datacontained within converted weight matrix 212, such as, for example, 8bit integer data, 16 bit integer data, 32 bit integer data, 16 bitfloating point data, 16 bit Bfloat data, 32 bit floating point data,etc.

Generally, vector register 432 simultaneously provides “n” elements toeach column of PEs 450 in PE array 402. In this embodiment, vectorregister 432 simultaneously provides four elements to each column of PEs450 in PE array 402. IN certain embodiments, vector register 432 may bedivided into four individual vector registers, one for each row ofconverted weight matrix 212, while in other embodiments, vector register432 may be a single register.

Register 440 includes vector register 442, which stores the elements ofthe output matrix in the multiplication operation, such as convertedoutput data matrix 216. In this embodiment, vector register 442 is 16elements wide, each element being the same size as the data containedwithin converted output data matrix 216, such as, for example, 8 bitinteger data, 16 bit integer data, 32 bit integer data, 16 bit floatingpoint data, 16 bit Bfloat data, 32 bit floating point data, etc. Vectorregisters 422, 432 and 442 all have the same size, such as, for example,8 bit integer data, etc.

FIG. 12 depicts a block diagram of PE 450 for MMA 400, in accordancewith an embodiment of the present disclosure. PE 450 represents oneembodiment of a DAP MAC unit; other embodiments are also contemplated.

PE 450 includes data selection circuit 451, data selectors ormultiplexers 452 ¹, 452 ², . . . , 452 ^(m), multiplier circuits 454 ¹,454 ², . . . , 454 ^(m), and accumulator circuit 456 that includes addercircuit 457 and accumulator register 458. For the embodiment depicted inFIGS. 9, 10 and 11, “m” equals 2; other configurations are alsocontemplated. For clarity of discussion, PE 450 will be described withrespect to this embodiment.

Data selection circuit 451 is coupled to scalar register 424 via a setof parallel data lines, and receives a mask or index, associated with acompressed basic block matrix 324, from scalar register 424. Dataselection circuit 451 is configured to decode the mask by applying abitmask, performing a bit shift operation, etc., and generate and sendappropriate selection signals to multiplexers 452 ¹ and 452 ².

Multiplexers 452 ¹ and 452 ² are coupled to vector register 432 via “n”sets of parallel data lines. The number of parallel data line sets, “n,”is equal to the number of weights in each weight group 322. In thisembodiment, there are 4 weights in each weight group 322 and 4 sets ofparallel data lines, so “n” equals 4. Each parallel data line settransfers one weight element from vector register 432 to eachmultiplexer 452 ¹ and 452 ². The number of parallel data lines in eachset is equal to the size of the element in vector register 432, such as8 for 8 bit integer data, 16 for 16-bit integer data, etc., as discussedabove.

Multiplexers 452 ¹ and 452 ² are coupled to data selection circuit 451via individual selection signal lines. Each selection signal linetransmits a selection signal that commands the particular multiplexer452 ^(i) to select a respective set of parallel data lines to output toan associated multiplier circuit 454 ^(i). In other words, eachmultiplexer 452 ^(i) selects an element from vector register 432, andprovides that element to the associated multiplier circuit 454 ^(i). Inthis embodiment, multiplexer 452 ¹ is coupled to multiplier circuit 454¹ via a set of parallel data lines, and multiplexer 452 ² is coupled tomultiplier circuit 454 ¹ via a set of parallel data lines.

Multiplier circuits 454 ¹ and 454 ² are coupled to vector register 422via respective sets of parallel data lines. Each parallel data line settransfers one element of a compressed basic block matrix from vectorregister 422 to multiplier circuit 454 ¹ or 454 ². The number ofparallel data lines in each set is equal to the size of the element invector register 422, such as 8 for 8 bit integer data, 16 for 16-bitinteger data, etc., as discussed above. Multiplier circuits 454 ¹ and454 ² are also coupled to accumulator circuit 256 via respective sets ofparallel data lines.

Multiplier circuit 454 ¹ multiplies the data value, m_(a), provided byvector register 422, and the data value, m_(w), provided by associatedmultiplexer 452 ¹ and outputs the resulting data value or intermediateproduct, ip₁, to accumulator circuit 456. The data values m_(a) andm_(w) and ip₁ have the same size, such as, for example, 8 bit integer,etc. Similarly, multiplier circuit 454 ² multiplies the data value,m_(a), provided by vector register 422, and the data value, m_(w),provided by associated multiplexer 452 ² and outputs the resulting datavalue or intermediate product, ip₂, to accumulator circuit 456.

Accumulator circuit 456 includes adder circuit 457 and accumulatorregister 458. Generally, adder circuit 457 adds the intermediateproducts ip₁ and ip₂ from multiplier circuits 454 ¹ and 454 ², andoutputs the resulting data value to accumulator register 458. Duringeach processing cycle, data selection circuit 451 performs a selectioncycle, multiplier circuits 454 ¹ and 454 ² perform simultaneous multiplycycles, and accumulator circuit 456 then performs an add cycle. Duringthe selection cycle, data selection circuit 451 determines, based on themask or index, which elements of the weight group 322 correspond to theelements of the compressed basic block matrix 324, and generates andsends the appropriate selection signals to multiplexers 452 ¹ and 452 ²,which select the appropriate parallel data lines. For this embodiment,the dot product calculation is complete after four processing cycles,and accumulator circuit 456 then outputs the data value to thecorresponding element of vector register 442.

More particularly, and in the interest of brevity, a single dot productcalculation cycle will be described using the PE 450 located in thefirst row and the first column (i.e., upper left corner) of PE array402. This particular PE 450 corresponds to DAP MAC unit DM₁ in FIG. 9.

During processing cycle 1, multiplexers 452 ¹ and 452 ² each receive w¹₁, w¹ ₅, w¹ ₉, w¹ ₁₃ from vector register 432, multiplier circuit 454 ¹receives a¹ ₁ from vector register 422, multiplier circuit 454 ¹receives a³ ₁ from vector register 422, and data selection circuit 451receives mask “1010” from scalar register 424. Data selection circuit451 determines that the first element, w¹ ₁, should be selected bymultiplexer 452 ¹ based on the first bit of the mask being set to 1,determines that the third element, w¹ ₉, should be selected bymultiplexer 452 ² based on the third bit of the mask being set to 1, andthen generates and sends the appropriate selection signals tomultiplexers 452 ¹ and 452 ².

Multiplexer 452 ¹ receives a selection signal from data selectioncircuit 451, selects and outputs the first element, w¹ ₁, to multipliercircuit 454 ¹. Multiplier circuit 454 ¹ multiplies w¹ ₁ and a¹ ₁ togenerate intermediate product ip₁, and sends intermediate product ip₁ toadder circuit 457. Multiplexer 452 ² receives a selection signal fromdata selection circuit 451, selects and outputs the third element, w¹ ₉,to multiplier circuit 454 ². Multiplier circuit 454 ² multiplies w¹ ₉and a³ ₁ to generate intermediate product ip₂, and sends intermediateproduct ip₂ to adder circuit 457. Adder circuit 457 adds intermediateproducts ip₁, ip₂ and the value stored in accumulator register 458(i.e., 0), and then outputs the resulting data value to accumulatorregister 458.

During processing cycle 2, multiplexers 452 ¹ and 452 ² each receive w¹₂, w¹ ₆, w¹ ₁₀, w¹ ₁₄ from vector register 432, multiplier circuit 454 ¹receives a² ₂ from vector register 422, multiplier circuit 454 ¹receives a⁴ ₂ from vector register 422, and data selection circuit 451receives mask “0101” from scalar register 424. Data selection circuit451 determines that the second element, w¹ ₆, should be selected bymultiplexer 452 ¹ based on the second bit of the mask being set to 1,determines that the fourth element, w¹ ₁₄, should be selected bymultiplexer 452 ² based on the fourth bit of the mask being set to 1,and then generates and sends the appropriate selection signals tomultiplexers 452 ¹ and 452 ².

Multiplexer 452 ¹ receives a selection signal from data selectioncircuit 451, selects and outputs the second element, w¹ ₆, to multipliercircuit 454 ¹. Multiplier circuit 454 ¹ multiplies w¹ ₆ and a² ₂ togenerate intermediate product ip₁, and sends intermediate product ip₁ toadder circuit 457. Multiplexer 452 ² receives a selection signal fromdata selection circuit 451, selects and outputs the fourth element, w¹₁₄, to multiplier circuit 454 ². Multiplier circuit 454 ² multiplies w¹₁₄ and a⁴ ₂ to generate intermediate product ip₂, and sends intermediateproduct ip₂ to adder circuit 457. Adder circuit 457 adds intermediateproducts ip₁, ip₂ and the value stored in accumulator register 458, andthen outputs the resulting data value to accumulator register 458.

During processing cycle 3, multiplexers 452 ¹ and 452 ² each receive w¹₃, w¹ ₇, w¹ ₁₁, w¹ ₁₅ from vector register 432, multiplier circuit 454 ¹receives a² ₆ from vector register 422, multiplier circuit 454 ¹receives a⁴ ₆ from vector register 422, and data selection circuit 451receives mask “0101” from scalar register 424. Data selection circuit451 determines that the second element, w¹ ₇, should be selected bymultiplexer 452 ¹ based on the second bit of the mask being set to 1,determines that the fourth element, w¹ ₁₅, should be selected bymultiplexer 452 ² based on the fourth bit of the mask being set to 1,and then generates and sends the appropriate selection signals tomultiplexers 452 ¹ and 452 ².

Multiplexer 452 ¹ receives a selection signal from data selectioncircuit 451, selects and outputs the second element, w¹ ₇, to multipliercircuit 454 ¹. Multiplier circuit 454 ¹ multiplies w¹ ₇ and a² ₆ togenerate intermediate product ip₁, and sends intermediate product ip₁ toadder circuit 457. Multiplexer 452 ² receives a selection signal fromdata selection circuit 451, selects and outputs the fourth element, w¹₁₅, to multiplier circuit 454 ². Multiplier circuit 454 ² multiplies w¹₁₅ and a⁴ ₆ to generate intermediate product ip₂, and sends intermediateproduct ip₂ to adder circuit 457. Adder circuit 457 adds intermediateproducts ip₁, ip₂ and the value stored in accumulator register 458, andthen outputs the resulting data value to accumulator register 458.

During processing cycle 4, multiplexers 452 ¹ and 452 ² each receive w¹₄, w¹ ₈, w¹ ₁₂, w¹ ₁₆ from vector register 432, multiplier circuit 454 ¹receives a¹ ₇ from vector register 422, multiplier circuit 454 ¹receives a⁴ ₇ from vector register 422, and data selection circuit 451receives mask “1001” from scalar register 424. Data selection circuit451 determines that the first element, w¹ ₄, should be selected bymultiplexer 452 ¹ based on the first bit of the mask being set to 1,determines that the fourth element, w¹ ₁₆, should be selected bymultiplexer 452 ² based on the fourth bit of the mask being set to 1,and then generates and sends the appropriate selection signals tomultiplexers 452 ¹ and 452 ².

Multiplexer 452 ¹ receives a selection signal from data selectioncircuit 451, selects and outputs the first element, w¹ ₄, to multipliercircuit 454 ¹. Multiplier circuit 454 ¹ multiplies w¹ ₄ and a² ₇ togenerate intermediate product ip₁, and sends intermediate product ip₁ toadder circuit 457. Multiplexer 452 ² receives a selection signal fromdata selection circuit 451, selects and outputs the fourth element, w¹₁₆, to multiplier circuit 454 ². Multiplier circuit 454 ² multiplies w¹₁₆ and a⁴ ₇ to generate intermediate product ip₂, and sends intermediateproduct ip₂ to adder circuit 457. Adder circuit 457 adds intermediateproducts ip₁, ip₂ and the value stored in accumulator register 458, andthen outputs the resulting data value to accumulator register 458.

At the end of processing cycle 4, accumulator register 458 outputs thestored data value to vector register 442.

FIG. 13 depicts a flow diagram 500 representing functionality associatedwith multiplying matrices, in accordance with embodiments of the presentdisclosure.

At 510, a number of basic block matrices are generated based on an inputtensor. Each basic block matrix includes a number of elements.

The functionality at 520, 530 and 540 is repeated for each basic blockmatrix.

At 520, the elements of the basic block matrix are pruned based on asparsity value.

At 530, a mask is generated for the basic block matrix. The maskincludes a number of bits. Each bit corresponds to a different elementof the basic block matrix.

At 540, the basic block matrix is compressed to generate a compressedbasic block matrix that has fewer elements than the basic block matrix.

At 550, the compressed basic block matrices and a weight matrix aremultiplied, based on the masks, to generate an output matrix.

Embodiments of the present disclosure advantageously provide a systemand method for multiplying matrices that significantly reduce “multiplyby zero” conditions. Embodiments of the present disclosure areapplicable to the multiplication of a dense matrix with a “sparse”matrix, and many embodiments accommodate any degree of sparsity ratio.The embodiments described above and summarized below are combinable.

In one embodiment, a system includes a processor coupled to a memory,and a matrix multiply accelerator (MMA) coupled to the processor and thememory. The processor is configured to generate, based on an inputtensor, a number of basic block matrices, each basic block matrixincluding a number of elements; for each basic block matrix: prune,based on a sparsity value, the elements of the basic block matrix,generate a mask for the basic block matrix, each mask including a numberof bits, each bit corresponding to a different element of the basicblock matrix, and compress the basic block matrix to generate acompressed basic block matrix having fewer elements than the basic blockmatrix. The MMA is configured to multiply, based on the masks, thecompressed basic block matrices and a weight matrix to generate anoutput matrix.

In another embodiment of the system, the sparsity value is a numbergreater than 0 and less than 1; and each bit in each mask has a value of1 when the corresponding element of the basic block matrix is presentwithin the compressed basic block matrix, and a value of 0 when thecorresponding element of the basic block matrix is not present withinthe compressed basic block matrix.

In another embodiment of the system, the input tensor includes a numberof input matrices, each input matrix has a same number of elements, andthe number of input matrices is equal to a number of channels; thenumber of basic block matrices is equal to the number of input matrixelements, and each basic block matrix has a first dimension that isequal to 1 and a second dimension that is equal to a hyperparameter b;and each compressed basic block matrix has a first dimension that isequal to 1 and a second dimension that is less than the hyperparameterb.

In another embodiment of the system, the processor is further configuredto flatten a number of weight tensors to generate the weight matrix,where each weight tensor has a first dimension, a second dimension, anda third dimension that is equal to the number of channels, and whereeach flattened weight tensor forms a row of the weight matrix.

In another embodiment of the system, the processor is further configuredto convert the input tensor into a converted input data matrix based ona convolution operation, re-sequence each column of the converted inputdata matrix into a sequence of compressed basic block matrices, andre-sequence each row of the weight matrix into a sequence of weightgroups based on the sequences of compressed basic block matrices; andthe MMA is further configured to multiply, based on the masks, thesequences of compressed basic block matrices and the sequences of weightgroups to generate the output matrix.

In another embodiment of the system, the MMA includes a first registerconfigured to store the masks and the compressed basic block matrices; asecond register configured to store the weight groups; a third registerconfigured to store the output matrix; and an array of processingelements (P Es), coupled to the first, second and third registers, eachPE configured to multiply, based on the masks, one sequence ofcompressed basic block matrices and one sequence of weight groups togenerate one element of the output matrix.

In another embodiment of the system, each PE includes a firstmultiplexer configured to receive a weight group within the sequence ofweight groups, and selectively output a first weight based on a firstdata selection signal; a second multiplexer configured to receive theweight group within the sequence of weight groups, and selectivelyoutput a second weight based on a second data selection signal; a dataselection circuit, coupled to the first multiplexer and the secondmultiplexer, configured to receive the mask corresponding to acompressed basic block matrix within the sequence of compressed basicblock matrices, generate the first data selection signal based on themask, and generate the second data selection signal based on the mask; afirst multiplier circuit, coupled to the first multiplexer, configuredto receive a first element from the compressed basic block matrix andthe first weight selectively output by the first multiplexer, multiplythe first element and the first weight to generate a first intermediateproduct, and output the first intermediate product; a second multipliercircuit, coupled to the second multiplexer, configured to receive asecond element from the compressed basic block matrix and the secondweight selectively output by the second multiplexer, multiply the secondelement and the second weight to generate a second intermediate product,and output the second intermediate product; and an accumulator circuit,coupled to the first and second multiplier circuits, configured toreceive the first and second intermediate products, and accumulate thefirst and second intermediate products into a value for one element ofthe output matrix.

In another embodiment of the system, each weight group includes at leastfour weights, each mask includes at least four bits, and each compressedbasic block matrix includes at least two elements.

In another embodiment of the system, the system further includes adynamic activation pruning (DAP) selection circuit, coupled to theprocessor, configured to generate the mask for each basic block matrix,the DAP selection circuit including a plurality of input registers, eachinput register including an enable bit and configured to store one basicblock matrix element; a chain of magnitude compactors, coupled to theinput registers, each magnitude compactor configured to receive twobasic block matrix elements, determine which basic block matrix elementhas a greater magnitude, and then pass the basic block matrix elementwith the greater magnitude to a next magnitude compactor in the chain ofmagnitude compactors; and an output register including a plurality ofoutput bits, each output bit corresponding to one of the inputregisters, where a final magnitude compactor in the chain of magnitudecompactors is configured to set the respective output bit within theoutput register, and where the input register enable bit indicateswhether the input register transmits or does not transmit the storedbasic block matrix element to the chain of magnitude compactors.

In one embodiment, a computer-based method for multiplying matricesincludes generating, based on an input tensor, a number of basic blockmatrices, each basic block matrix including a number of elements; foreach basic block matrix: pruning, based on a sparsity value, theelements of the basic block matrix, generating a mask for the basicblock matrix, the mask including a number of bits, each bitcorresponding to a different element of the basic block matrix, andcompressing the basic block matrix to generate a compressed basic blockmatrix having fewer elements than the basic block matrix; andmultiplying, based on the masks, the compressed basic block matrices anda weight matrix to generate an output matrix.

In another embodiment of the computer-based method, the sparsity valueis a number greater than 0 and less than 1; and each bit in each maskhas a value of 1 when the corresponding element of the basic blockmatrix is present within the compressed basic block matrix, and a valueof 0 when the corresponding element of the basic block matrix is notpresent within the compressed basic block matrix.

In another embodiment of the computer-based method, the input tensorincludes a number of input matrices, each input matrix has a same numberof elements, and the number of input matrices is equal to a number ofchannels; the number of basic block matrices is equal to the number ofinput matrix elements, and each basic block matrix has a first dimensionthat is equal to 1 and a second dimension that is equal to ahyperparameter b; and each compressed basic block matrix has a firstdimension that is equal to 1 and a second dimension that is less thanthe hyperparameter b.

In another embodiment of the computer-based method, the method furtherincludes flattening a number of weight tensors to generate the weightmatrix, where each weight tensor has a first dimension, a seconddimension, and a third dimension that is equal to the number ofchannels, and where each flattened weight tensor forms a row of theweight matrix.

In another embodiment of the computer-based method, the method furtherincludes converting the input tensor into a converted input data matrixbased on a convolution operation; re-sequencing each column of theconverted input data matrix into a sequence of compressed basic blockmatrices; re-sequencing each row of the weight matrix into a sequence ofweight groups based on the sequences of compressed basic block matrices;and multiplying, based on the masks, the sequences of compressed basicblock matrices and the sequences of weight groups to generate the outputmatrix.

In another embodiment of the computer-based method, the multiplying isperformed by an array of processing elements (PEs), each PE isconfigured to multiply, based on the masks, one sequence of compressedbasic block matrices and one sequence of weight groups to generate oneelement of the output matrix.

In another embodiment of the computer-based method, each PE includes afirst multiplexer configured to receive a weight group within thesequence of weight groups, and selectively output a first weight basedon a first data selection signal; a second multiplexer configured toreceive the weight group within the sequence of weight groups, andselectively output a second weight based on a second data selectionsignal; a data selection circuit, coupled to the first multiplexer andthe second multiplexer, configured to receive the mask corresponding toa compressed basic block matrix within the sequence of compressed basicblock matrices, generate the first data selection signal based on themask, and generate the second data selection signal based on the mask; afirst multiplier circuit, coupled to the first multiplexer, configuredto receive a first element from the compressed basic block matrix andthe first weight selectively output by the first multiplexer, multiplythe first element and the first weight to generate a first intermediateproduct, and output the first intermediate product; a second multipliercircuit, coupled to the second multiplexer, configured to receive asecond element from the compressed basic block matrix and the secondweight selectively output by the second multiplexer, multiply the secondelement and the second weight to generate a second intermediate product,and output the second intermediate product; and an accumulator circuit,coupled to the first and second multiplier circuits, configured toreceive the first and second intermediate products, and accumulate thefirst and second intermediate products into a value for one element ofthe output matrix.

In another embodiment of the computer-based method, each weight groupincludes at least four weights, each mask includes at least four bits,and each compressed basic block matrix includes at least two elements.

In one embodiment, a computer-based method for training a convolutionalneural network (CNN) includes during a forward phase: providing a filterto a convolutional layer, the filter including a number of weight sets;providing input feature maps to a dynamic activation pruning (DAP)process; at the DAP process: generating, based on the input featuremaps, a number of basic block matrices, each basic block matrixincluding a number of elements, pruning, based on a sparsity value, theelements of each basic block matrix, generating a mask for each basicblock matrix, the mask including a number of bits, each bitcorresponding to a different element of the basic block matrix,compressing each basic block matrix to generate a compressed basic blockmatrix having fewer elements than the basic block matrix, and providingthe masks and the compressed basic block matrices to the convolutionallayer; at the convolutional layer: multiplying, based on the masks, thecompressed basic block matrices and the weight sets to generate outputfeature maps; during a backward phase: backpropagating gradients; andupdating the weight sets.

In another embodiment of the computer-based training method, thesparsity value is a number greater than 0 and less than 1; and each bitin each mask has a value of 1 when the corresponding element of thebasic block matrix is present within the compressed basic block matrix,and a value of 0 when the corresponding element of the basic blockmatrix is not present within the compressed basic block matrix

In another embodiment of the computer-based training method, at theconvolutional layer: flattening the weight sets to generate a weightmatrix, each flattened weight set forming a row of the weight matrix;converting the input feature maps into a converted input data matrixbased on a convolution operation; re-sequencing each column of theconverted input data matrix into a sequence of compressed basic blockmatrices; re-sequencing each row of the weight matrix into a sequence ofweight groups based on the sequences of compressed basic block matrices;multiplying, based on the masks, the sequences of compressed basic blockmatrices and the sequences of weight groups to generate an output datamatrix; and generating the output feature maps based on the output datamatrix.

While implementations of the disclosure are susceptible to embodiment inmany different forms, there is shown in the drawings and will herein bedescribed in detail specific embodiments, with the understanding thatthe present disclosure is to be considered as an example of theprinciples of the disclosure and not intended to limit the disclosure tothe specific embodiments shown and described. In the description above,like reference numerals may be used to describe the same, similar orcorresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top andbottom, and the like may be used solely to distinguish one entity oraction from another entity or action without necessarily requiring orimplying any actual such relationship or order between such entities oractions. The terms “comprises,” “comprising,” “includes,” “including,”“has,” “having,” or any other variations thereof, are intended to covera non-exclusive inclusion, such that a process, method, article, orapparatus that comprises a list of elements does not include only thoseelements but may include other elements not expressly listed or inherentto such process, method, article, or apparatus. An element preceded by“comprises . . . a” does not, without more constraints, preclude theexistence of additional identical elements in the process, method,article, or apparatus that comprises the element.

Reference throughout this document to “one embodiment,” “certainembodiments,” “an embodiment,” “implementation(s),” “aspect(s),” orsimilar terms means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present disclosure. Thus, theappearances of such phrases or in various places throughout thisspecification are not necessarily all referring to the same embodiment.Furthermore, the particular features, structures, or characteristics maybe combined in any suitable manner in one or more embodiments withoutlimitation.

The term “or” as used herein is to be interpreted as an inclusive ormeaning any one or any combination. Therefore, “A, B or C” means “any ofthe following: A; B; C; A and B; A and C; B and C; A, B and C.” Anexception to this definition will occur only when a combination ofelements, functions, steps or acts are in some way inherently mutuallyexclusive. Also, grammatical conjunctions are intended to express anyand all disjunctive and conjunctive combinations of conjoined clauses,sentences, words, and the like, unless otherwise stated or clear fromthe context. Thus, the term “or” should generally be understood to mean“and/or” and so forth. References to items in the singular should beunderstood to include items in the plural, and vice versa, unlessexplicitly stated otherwise or clear from the text.

Recitation of ranges of values herein are not intended to be limiting,referring instead individually to any and all values falling within therange, unless otherwise indicated, and each separate value within such arange is incorporated into the specification as if it were individuallyrecited herein. The words “about,” “approximately,” or the like, whenaccompanying a numerical value, are to be construed as indicating adeviation as would be appreciated by one of ordinary skill in the art tooperate satisfactorily for an intended purpose. Ranges of values and/ornumeric values are provided herein as examples only, and do notconstitute a limitation on the scope of the described embodiments. Theuse of any and all examples, or exemplary language (“e.g.,” “such as,”“for example,” or the like) provided herein, is intended merely tobetter illuminate the embodiments and does not pose a limitation on thescope of the embodiments. No language in the specification should beconstrued as indicating any unclaimed element as essential to thepractice of the embodiments.

For simplicity and clarity of illustration, reference numerals may berepeated among the figures to indicate corresponding or analogouselements. Numerous details are set forth to provide an understanding ofthe embodiments described herein. The embodiments may be practicedwithout these details. In other instances, well-known methods,procedures, and components have not been described in detail to avoidobscuring the embodiments described. The description is not to beconsidered as limited to the scope of the embodiments described herein.

In the following description, it is understood that terms such as“first,” “second,” “top,” “bottom,” “up,” “down,” “above,” “below,” andthe like, are words of convenience and are not to be construed aslimiting terms. Also, the terms apparatus, device, system, etc. may beused interchangeably in this text.

The many features and advantages of the disclosure are apparent from thedetailed specification, and, thus, it is intended by the appended claimsto cover all such features and advantages of the disclosure which fallwithin the scope of the disclosure. Further, since numerousmodifications and variations will readily occur to those skilled in theart, it is not desired to limit the disclosure to the exact constructionand operation illustrated and described, and, accordingly, all suitablemodifications and equivalents may be resorted to that fall within thescope of the disclosure.

What is claimed is:
 1. A system, comprising: a processor, coupled to amemory, configured to: generate, based on an input tensor, a number ofbasic block matrices, each basic block matrix including a number ofelements, for each basic block matrix: prune, based on a sparsity value,the elements of the basic block matrix, generate a mask for the basicblock matrix, each mask including a number of bits, each bitcorresponding to a different element of the basic block matrix, andcompress the basic block matrix to generate a compressed basic blockmatrix having fewer elements than the basic block matrix; and a matrixmultiply accelerator (MMA), coupled to the processor and the memory,configured to: multiply, based on the masks, the compressed basic blockmatrices and a weight matrix to generate an output matrix.
 2. The systemaccording to claim 1, where: the sparsity value is a number greater than0 and less than 1; and each bit in each mask has a value of 1 when thecorresponding element of the basic block matrix is present within thecompressed basic block matrix, and a value of 0 when the correspondingelement of the basic block matrix is not present within the compressedbasic block matrix.
 3. The system according to claim 2, where: the inputtensor includes a number of input matrices, each input matrix has a samenumber of elements, and the number of input matrices is equal to anumber of channels; the number of basic block matrices is equal to thenumber of input matrix elements, and each basic block matrix has a firstdimension that is equal to 1 and a second dimension that is equal to ahyperparameter b; and each compressed basic block matrix has a firstdimension that is equal to 1 and a second dimension that is less thanthe hyperparameter b.
 4. The system according to claim 3, where theprocessor is further configured to: flatten a number of weight tensorsto generate the weight matrix, where each weight tensor has a firstdimension, a second dimension, and a third dimension that is equal tothe number of channels, and where each flattened weight tensor forms arow of the weight matrix.
 5. The system according to claim 1, where: theprocessor is further configured to: convert the input tensor into aconverted input data matrix based on a convolution operation,re-sequence each column of the converted input data matrix into asequence of compressed basic block matrices, and re-sequence each row ofthe weight matrix into a sequence of weight groups based on thesequences of compressed basic block matrices; and the MMA is furtherconfigured to: multiply, based on the masks, the sequences of compressedbasic block matrices and the sequences of weight groups to generate theoutput matrix.
 6. The system according to claim 5, where the MMAincludes: a first register configured to store the masks and thecompressed basic block matrices; a second register configured to storethe weight groups; a third register configured to store the outputmatrix; and an array of processing elements (PEs), coupled to the first,second and third registers, each PE configured to multiply, based on themasks, one sequence of compressed basic block matrices and one sequenceof weight groups to generate one element of the output matrix.
 7. Thesystem according to claim 6, where each PE includes: a first multiplexerconfigured to receive a weight group within the sequence of weightgroups, and selectively output a first weight based on a first dataselection signal; a second multiplexer configured to receive the weightgroup within the sequence of weight groups, and selectively output asecond weight based on a second data selection signal; a data selectioncircuit, coupled to the first multiplexer and the second multiplexer,configured to receive the mask corresponding to a compressed basic blockmatrix within the sequence of compressed basic block matrices, generatethe first data selection signal based on the mask, and generate thesecond data selection signal based on the mask; a first multipliercircuit, coupled to the first multiplexer, configured to receive a firstelement from the compressed basic block matrix and the first weightselectively output by the first multiplexer, multiply the first elementand the first weight to generate a first intermediate product, andoutput the first intermediate product; a second multiplier circuit,coupled to the second multiplexer, configured to receive a secondelement from the compressed basic block matrix and the second weightselectively output by the second multiplexer, multiply the secondelement and the second weight to generate a second intermediate product,and output the second intermediate product; and an accumulator circuit,coupled to the first and second multiplier circuits, configured toreceive the first and second intermediate products, and accumulate thefirst and second intermediate products into a value for one element ofthe output matrix.
 8. The system according to claim 7, where each weightgroup includes at least four weights, each mask includes at least fourbits, and each compressed basic block matrix includes at least twoelements.
 9. The system according to claim 1, further comprising adynamic activation pruning (DAP) selection circuit, coupled to theprocessor, configured to generate the mask for each basic block matrix,the DAP selection circuit including: a plurality of input registers,each input register including an enable bit and configured to store onebasic block matrix element; a chain of magnitude compactors, coupled tothe input registers, each magnitude compactor configured to receive twobasic block matrix elements, determine which basic block matrix elementhas a greater magnitude, and then pass the basic block matrix elementwith the greater magnitude to a next magnitude compactor in the chain ofmagnitude compactors; and an output register including a plurality ofoutput bits, each output bit corresponding to one of the inputregisters, where a final magnitude compactor in the chain of magnitudecompactors is configured to set the respective output bit within theoutput register, and where the input register enable bit indicateswhether the input register transmits or does not transmit the storedbasic block matrix element to the chain of magnitude compactors.
 10. Acomputer-based method for multiplying matrices, comprising: generating,based on an input tensor, a number of basic block matrices, each basicblock matrix including a number of elements; for each basic blockmatrix: pruning, based on a sparsity value, the elements of the basicblock matrix, generating a mask for the basic block matrix, the maskincluding a number of bits, each bit corresponding to a differentelement of the basic block matrix, and compressing the basic blockmatrix to generate a compressed basic block matrix having fewer elementsthan the basic block matrix; and multiplying, based on the masks, thecompressed basic block matrices and a weight matrix to generate anoutput matrix.
 11. The computer-based method according to claim 10,where: the sparsity value is a number greater than 0 and less than 1;and each bit in each mask has a value of 1 when the correspondingelement of the basic block matrix is present within the compressed basicblock matrix, and a value of 0 when the corresponding element of thebasic block matrix is not present within the compressed basic blockmatrix.
 12. The computer-based method according to claim 11, where: theinput tensor includes a number of input matrices, each input matrix hasa same number of elements, and the number of input matrices is equal toa number of channels; the number of basic block matrices is equal to thenumber of input matrix elements, and each basic block matrix has a firstdimension that is equal to 1 and a second dimension that is equal to ahyperparameter b; and each compressed basic block matrix has a firstdimension that is equal to 1 and a second dimension that is less thanthe hyperparameter b.
 13. The computer-based method according to claim12, further comprising: flattening a number of weight tensors togenerate the weight matrix, where each weight tensor has a firstdimension, a second dimension, and a third dimension that is equal tothe number of channels, and where each flattened weight tensor forms arow of the weight matrix.
 14. The computer-based method according toclaim 10, further comprising: converting the input tensor into aconverted input data matrix based on a convolution operation;re-sequencing each column of the converted input data matrix into asequence of compressed basic block matrices; re-sequencing each row ofthe weight matrix into a sequence of weight groups based on thesequences of compressed basic block matrices; and multiplying, based onthe masks, the sequences of compressed basic block matrices and thesequences of weight groups to generate the output matrix.
 15. Thecomputer-based method according to claim 14, where said multiplying isperformed by an array of processing elements (PEs), each PE isconfigured to multiply, based on the masks, one sequence of compressedbasic block matrices and one sequence of weight groups to generate oneelement of the output matrix.
 16. The computer-based method according toclaim 15, where each PE includes: a first multiplexer configured toreceive a weight group within the sequence of weight groups, andselectively output a first weight based on a first data selectionsignal; a second multiplexer configured to receive the weight groupwithin the sequence of weight groups, and selectively output a secondweight based on a second data selection signal; a data selectioncircuit, coupled to the first multiplexer and the second multiplexer,configured to receive the mask corresponding to a compressed basic blockmatrix within the sequence of compressed basic block matrices, generatethe first data selection signal based on the mask, and generate thesecond data selection signal based on the mask; a first multipliercircuit, coupled to the first multiplexer, configured to receive a firstelement from the compressed basic block matrix and the first weightselectively output by the first multiplexer, multiply the first elementand the first weight to generate a first intermediate product, andoutput the first intermediate product; a second multiplier circuit,coupled to the second multiplexer, configured to receive a secondelement from the compressed basic block matrix and the second weightselectively output by the second multiplexer, multiply the secondelement and the second weight to generate a second intermediate product,and output the second intermediate product; and an accumulator circuit,coupled to the first and second multiplier circuits, configured toreceive the first and second intermediate products, and accumulate thefirst and second intermediate products into a value for one element ofthe output matrix.
 17. The computer-based method according to claim 16,where each weight group includes at least four weights, each maskincludes at least four bits, and each compressed basic block matrixincludes at least two elements.
 18. A computer-based method for traininga convolutional neural network (CNN), comprising: during a forwardphase: providing a filter to a convolutional layer, the filter includinga number of weight sets; providing input feature maps to a dynamicactivation pruning (DAP) process; at the DAP process: generating, basedon the input feature maps, a number of basic block matrices, each basicblock matrix including a number of elements, pruning, based on asparsity value, the elements of each basic block matrix, generating amask for each basic block matrix, the mask including a number of bits,each bit corresponding to a different element of the basic block matrix,compressing each basic block matrix to generate a compressed basic blockmatrix having fewer elements than the basic block matrix, and providingthe masks and the compressed basic block matrices to the convolutionallayer; at the convolutional layer: multiplying, based on the masks, thecompressed basic block matrices and the weight sets to generate outputfeature maps; during a backward phase: backpropagating gradients; andupdating the weight sets.
 19. The computer-based method according toclaim 18, where the sparsity value is a number greater than 0 and lessthan 1; and each bit in each mask has a value of 1 when thecorresponding element of the basic block matrix is present within thecompressed basic block matrix, and a value of 0 when the correspondingelement of the basic block matrix is not present within the compressedbasic block matrix.
 20. The computer-based method according to claim 19,further comprising: at the convolutional layer: flattening the weightsets to generate a weight matrix, each flattened weight set forming arow of the weight matrix; converting the input feature maps into aconverted input data matrix based on a convolution operation;re-sequencing each column of the converted input data matrix into asequence of compressed basic block matrices; re-sequencing each row ofthe weight matrix into a sequence of weight groups based on thesequences of compressed basic block matrices; multiplying, based on themasks, the sequences of compressed basic block matrices and thesequences of weight groups to generate an output data matrix; andgenerating the output feature maps based on the output data matrix.